Swish Analytics: NFL Data Scientist Take Home Assesment¶

Author: Hunter Lybbert¶

Contact Information¶

No description has been provided for this image No description has been provided for this image No description has been provided for this image

Go to Model Training

Outline¶

  1. Summary
  2. Brainstorm
  3. Setup
  4. Exploratory Data Analysis (EDA)
  5. Modeling

Summary¶

In this notebook we analyze Historical NFL game data. We attempt to predict the probability that the next pass play will result in a sack. Sacks can be a huge moment in NFL games where the momentum can totally shift or it can be the nail in the coffin. Hence, there is a lot of motivation to be able to predict the likelihood of a sack occuring for both the offense, defense, and even the viewers experience which is the case in the context of Swish Analytics as they are providing data for sports betting platforms.

The data provided includes play by play information for all games in the 2021-2023 seasons. Additionally there was metadata provided which contained information about team rosters in those same years, all nfl players of all time, depth charts, playing time information, and lastly advanced stats for defensive players, rushers, and passers.

Our analysis goes as follows

  1. Assess the data
  2. Build several predictive models, and finally
  3. Compare and evaluate their performances against one another

Below you will find the exploratory data analysis including plots and graphs, feature engineering, and code assembled in one place.

The following data dictionaries were crucial in helping with the analysis:

  • Dictionary
  • Depth Chart Dictionary
  • Snap Counts Dictionary
  • Dictionary Rosters
No description has been provided for this image

Depicted above is a nice hard hitting sack of Tom Brady.

Brainstorm¶

My initial ideas for data that would be helpful in predicting the likelihood of a sack.

  • Historical number of sacks for each player on the defense
    • You could also weight this by position, giving a higher weight to defensive lineman and a lower weight to strong safeties and corners (who sometimes are included in a blitz)
  • Sacks allowed by the offensive linemen
    • This would be telling if the offensive lineman just tend to allow more sacks but I wouldn't expect this to be that big of a factor
  • The number of times the particular quarterback in play has been sacked
  • Down and distance
    • A sack must occur on a passing play or intended passing play
    • Would expect there to be more passes on later downs, but given a certain score differential and a time of the game, passes could become more likely on earlier downs
  • Field position
    • Not exactly sure, but I do believe a sack would be unlikely behind your own 15-20ydl
  • Score of the game
    • Larger differential would make the team that is down more likely to pass in more desperate situations which would perhaps lead to more sacks
  • Timestamp of the game
    • more desperate later in the game could result in more sacks, not sure

For convenience let's establish a bit of notation around this probability distribution that we want to estimate. Let $N$ bt the total number of pass plays in our data set and let $S$ be the total number of sacks in our same dataset. Additionally let $P(X=0|\rho, \theta)$ be the probability Models to Try:

  • Purely based on historical information
    • For example, the most naive model would be to say, letting $N$ be the total number of pass plays and S be the total number of sacks
      • Likelihood = (number of pass plays resulting in a sack)/(number of total pass plays)
    • Then start making it more complicated a little at a time
      • Likelihood = (number of plays resulting in a sack | this team is on defense)/(number of total plays | this team is on defense)
      • You could furthermore identify it with given information about which players are currently on defense
  • Train a model
    • Random forest regressor
      • Does this work when building a distribution?
      • We don’t just want to predict whether it’s a sack or not with a certain accuracy. We want to determine the odds it will happen
  • Can deep learning help

Setup¶

In [1]:
from typing import Optional

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils.class_weight import compute_sample_weight

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix,
    ConfusionMatrixDisplay,
    PrecisionRecallDisplay,
    f1_score,
    recall_score,
    accuracy_score,
    precision_score,
)

from xgboost import XGBClassifier

import altair as alt
import matplotlib .pyplot as plt


alt.data_transformers.disable_max_rows()
alt.renderers.enable('default')
pd.set_option('display.max_columns', None)

Exploritory Data Analysis¶

Let's begin by exploring and getting familiar with the various datasets which I have been provided with. Additionally for reference see the following data dictionaries

  • Dictionary
  • Depth Chart Dictionary
  • Snap Counts Dictionary
  • Dictionary Rosters
In [2]:
players_df = pd.read_csv(
    "../data/players.csv",
    header=0,
    nrows=1000
)
players_df.head()
Out[2]:
status display_name first_name last_name esb_id gsis_id birth_date college_name position_group position jersey_number height weight years_of_experience team_abbr team_seq current_team_id football_name entry_year rookie_year draft_club draft_number draftround college_conference status_description_abbr status_short_description gsis_it_id short_name smart_id headshot uniform_number suffix
0 RET 'Omar Ellison 'Omar Ellison ELL711319 00-0004866 1971-10-08 NaN WR WR 84.0 73.0 200.0 2.0 LAC NaN 4400.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3200454c-4c71-1319-728e-d49d3d236f8f NaN NaN NaN
1 ACT A'Shawn Robinson A'Shawn Robinson ROB367960 00-0032889 1995-03-21 Alabama DL DE 94.0 76.0 330.0 9.0 CAR 1.0 750.0 A'Shawn 2016.0 2016.0 DET 46.0 2.0 Southeastern Conference A01 Active 43335.0 A.Robinson 3200524f-4236-7960-bf20-bc060ac0f49c https://static.www.nfl.com/image/upload/f_auto... 94 NaN
2 DEV A.J. Arcuri A.J. Arcuri ARC716900 00-0037845 1997-08-13 Michigan State OL T 61.0 79.0 320.0 2.0 LA NaN 2510.0 A.J. 2022.0 2022.0 LA 261.0 7.0 Big Ten Conference P01 Practice Squad 54726.0 A.Arcuri 32004152-4371-6900-5185-8cdd66b2ad11 https://static.www.nfl.com/image/upload/f_auto... 61 NaN
3 RES A.J. Bouye Arlandus Bouye BOU651714 00-0030228 1991-08-16 Central Florida DB CB 24.0 72.0 191.0 8.0 CAR 1.0 750.0 A.J. 2013.0 2013.0 NaN NaN NaN American Athletic Conference R01 R/Injured 40688.0 A.Bouye 3200424f-5565-1714-cb38-07c822111a12 https://static.www.nfl.com/image/private/f_aut... 24 NaN
4 ACT A.J. Brown Arthur Brown BRO413223 00-0035676 1997-06-30 Mississippi WR WR 11.0 72.0 226.0 6.0 PHI 1.0 3700.0 A.J. 2019.0 2019.0 TEN 51.0 2.0 Southeastern Conference A01 Active 47834.0 A.Brown 32004252-4f41-3223-e4c5-1e30dffa87f8 https://static.www.nfl.com/image/private/f_aut... 11 NaN
In [3]:
depth_charts_2023_df = pd.read_csv(
    "../data/depth_charts_2023.csv",
    header=0,
    # nrows=1000
)
# depth_charts_2023_df[(depth_charts_2023_df["club_code"] == "SEA") & (depth_charts_2023_df["formation"] == "Defense")].sort_values(by=["week", "position", "depth_team"]).head(25)

play_by_play_2023_df = pd.read_csv(
    "../data/play_by_play_2022.csv",
    header=0,
    # nrows=10,
    low_memory=False
)
play_by_play_2023_df.sort_values(by=['week', 'play_id']).dropna(subset=["play_type"]).head()
Out[3]:
play_id game_id old_game_id home_team away_team season_type week posteam posteam_type defteam side_of_field yardline_100 game_date quarter_seconds_remaining half_seconds_remaining game_seconds_remaining game_half quarter_end drive sp qtr down goal_to_go time yrdln ydstogo ydsnet desc play_type yards_gained shotgun no_huddle qb_dropback qb_kneel qb_spike qb_scramble pass_length pass_location air_yards yards_after_catch run_location run_gap field_goal_result kick_distance extra_point_result two_point_conv_result home_timeouts_remaining away_timeouts_remaining timeout timeout_team td_team td_player_name td_player_id posteam_timeouts_remaining defteam_timeouts_remaining total_home_score total_away_score posteam_score defteam_score score_differential posteam_score_post defteam_score_post score_differential_post no_score_prob opp_fg_prob opp_safety_prob opp_td_prob fg_prob safety_prob td_prob extra_point_prob two_point_conversion_prob ep epa total_home_epa total_away_epa total_home_rush_epa total_away_rush_epa total_home_pass_epa total_away_pass_epa air_epa yac_epa comp_air_epa comp_yac_epa total_home_comp_air_epa total_away_comp_air_epa total_home_comp_yac_epa total_away_comp_yac_epa total_home_raw_air_epa total_away_raw_air_epa total_home_raw_yac_epa total_away_raw_yac_epa wp def_wp home_wp away_wp wpa vegas_wpa vegas_home_wpa home_wp_post away_wp_post vegas_wp vegas_home_wp total_home_rush_wpa total_away_rush_wpa total_home_pass_wpa total_away_pass_wpa air_wpa yac_wpa comp_air_wpa comp_yac_wpa total_home_comp_air_wpa total_away_comp_air_wpa total_home_comp_yac_wpa total_away_comp_yac_wpa total_home_raw_air_wpa total_away_raw_air_wpa total_home_raw_yac_wpa total_away_raw_yac_wpa punt_blocked first_down_rush first_down_pass first_down_penalty third_down_converted third_down_failed fourth_down_converted fourth_down_failed incomplete_pass touchback interception punt_inside_twenty punt_in_endzone punt_out_of_bounds punt_downed punt_fair_catch kickoff_inside_twenty kickoff_in_endzone kickoff_out_of_bounds kickoff_downed kickoff_fair_catch fumble_forced fumble_not_forced fumble_out_of_bounds solo_tackle safety penalty tackled_for_loss fumble_lost own_kickoff_recovery own_kickoff_recovery_td qb_hit rush_attempt pass_attempt sack touchdown pass_touchdown rush_touchdown return_touchdown extra_point_attempt two_point_attempt field_goal_attempt kickoff_attempt punt_attempt fumble complete_pass assist_tackle lateral_reception lateral_rush lateral_return lateral_recovery passer_player_id passer_player_name passing_yards receiver_player_id receiver_player_name receiving_yards rusher_player_id rusher_player_name rushing_yards lateral_receiver_player_id lateral_receiver_player_name lateral_receiving_yards lateral_rusher_player_id lateral_rusher_player_name lateral_rushing_yards lateral_sack_player_id lateral_sack_player_name interception_player_id interception_player_name lateral_interception_player_id lateral_interception_player_name punt_returner_player_id punt_returner_player_name lateral_punt_returner_player_id lateral_punt_returner_player_name kickoff_returner_player_name kickoff_returner_player_id lateral_kickoff_returner_player_id lateral_kickoff_returner_player_name punter_player_id punter_player_name kicker_player_name kicker_player_id own_kickoff_recovery_player_id own_kickoff_recovery_player_name blocked_player_id blocked_player_name tackle_for_loss_1_player_id tackle_for_loss_1_player_name tackle_for_loss_2_player_id tackle_for_loss_2_player_name qb_hit_1_player_id qb_hit_1_player_name qb_hit_2_player_id qb_hit_2_player_name forced_fumble_player_1_team forced_fumble_player_1_player_id forced_fumble_player_1_player_name forced_fumble_player_2_team forced_fumble_player_2_player_id forced_fumble_player_2_player_name solo_tackle_1_team solo_tackle_2_team solo_tackle_1_player_id solo_tackle_2_player_id solo_tackle_1_player_name solo_tackle_2_player_name assist_tackle_1_player_id assist_tackle_1_player_name assist_tackle_1_team assist_tackle_2_player_id assist_tackle_2_player_name assist_tackle_2_team assist_tackle_3_player_id assist_tackle_3_player_name assist_tackle_3_team assist_tackle_4_player_id assist_tackle_4_player_name assist_tackle_4_team tackle_with_assist tackle_with_assist_1_player_id tackle_with_assist_1_player_name tackle_with_assist_1_team tackle_with_assist_2_player_id tackle_with_assist_2_player_name tackle_with_assist_2_team pass_defense_1_player_id pass_defense_1_player_name pass_defense_2_player_id pass_defense_2_player_name fumbled_1_team fumbled_1_player_id fumbled_1_player_name fumbled_2_player_id fumbled_2_player_name fumbled_2_team fumble_recovery_1_team fumble_recovery_1_yards fumble_recovery_1_player_id fumble_recovery_1_player_name fumble_recovery_2_team fumble_recovery_2_yards fumble_recovery_2_player_id fumble_recovery_2_player_name sack_player_id sack_player_name half_sack_1_player_id half_sack_1_player_name half_sack_2_player_id half_sack_2_player_name return_team return_yards penalty_team penalty_player_id penalty_player_name penalty_yards replay_or_challenge replay_or_challenge_result penalty_type defensive_two_point_attempt defensive_two_point_conv defensive_extra_point_attempt defensive_extra_point_conv safety_player_name safety_player_id season cp cpoe series series_success series_result order_sequence start_time time_of_day stadium weather nfl_api_id play_clock play_deleted play_type_nfl special_teams_play st_play_type end_clock_time end_yard_line fixed_drive fixed_drive_result drive_real_start_time drive_play_count drive_time_of_possession drive_first_downs drive_inside20 drive_ended_with_score drive_quarter_start drive_quarter_end drive_yards_penalized drive_start_transition drive_end_transition drive_game_clock_start drive_game_clock_end drive_start_yard_line drive_end_yard_line drive_play_id_started drive_play_id_ended away_score home_score location result total spread_line total_line div_game roof surface temp wind home_coach away_coach stadium_id game_stadium aborted_play success passer passer_jersey_number rusher rusher_jersey_number receiver receiver_jersey_number pass rush first_down special play passer_id rusher_id receiver_id name jersey_number id fantasy_player_name fantasy_player_id fantasy fantasy_id out_of_bounds home_opening_kickoff qb_epa xyac_epa xyac_mean_yardage xyac_median_yardage xyac_success xyac_fd xpass pass_oe
682 40 2022_01_GB_MIN 2022091112 MIN GB REG 1 MIN home GB GB 35.0 2022-09-11 900 1800 3600 Half1 0 1.0 0 1 NaN 0 15:00 GB 35 0 78.0 2-M.Crosby kicks 68 yards from GB 35 to MIN -3... kickoff 0.0 0 0 0.0 0 0 0 NaN NaN NaN NaN NaN NaN NaN 68.0 NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.003488 0.133672 0.002093 0.210531 0.203776 0.003241 0.443199 0.0 0.0 1.841284 -0.402527 -0.402527 0.402527 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.546262 0.453738 0.546262 0.453738 0.000707 -0.000830 -0.000830 0.546969 0.453031 0.569057 0.569057 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN K.Nwangwu 00-0036842 NaN NaN NaN NaN M.Crosby 00-0025580 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN GB NaN 00-0036901 NaN E.Stokes NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN MIN 25.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 NaN NaN 1 1 First down 40 9/11/22, 16:25:39 2022-09-11T20:25:39.297Z U.S. Bank Stadium N/A (Indoors) Temp: 73° F, Humidity: 31%, Wind... 7ae7fefa-d24c-11ec-b23d-d15a91047884 0 0 KICK_OFF 1 NaN 2022-09-11T20:25:49.117Z NaN 1 Touchdown 2022-09-11T20:25:39.297Z 10.0 5:43 4.0 1.0 1.0 1.0 1.0 0.0 KICKOFF TOUCHDOWN 15:00 09:17 MIN 22 GB 5 40.0 300.0 7 23 Home 16 30 2.5 47.0 1 dome sportturf NaN NaN Kevin O'Connell Matt LaFleur MIN01 U.S. Bank Stadium 0 0.0 NaN NaN NaN NaN NaN NaN 0 0 0.0 1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 1 -0.402527 NaN NaN NaN NaN NaN NaN NaN
1250 40 2022_01_KC_ARI 2022091110 ARI KC REG 1 KC away ARI ARI 35.0 2022-09-11 900 1800 3600 Half1 0 1.0 0 1 NaN 0 15:00 ARI 35 0 75.0 5-M.Prater kicks 65 yards from ARI 35 to end z... kickoff 0.0 0 0 0.0 0 0 0 NaN NaN NaN NaN NaN NaN NaN 65.0 NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.002202 0.137016 0.002210 0.269720 0.217728 0.003372 0.367752 0.0 0.0 0.930688 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.433208 0.566792 0.566792 0.433208 0.000000 0.000000 0.000000 0.566792 0.433208 0.725613 0.274387 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN M.Prater 00-0023853 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN KC 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 NaN NaN 1 1 First down 40 9/11/22, 16:26:28 2022-09-11T20:26:28.687Z State Farm Stadium N/A Temp: Humidity: Wind: mph 7ae7f21b-d24c-11ec-b23d-d15a91047884 0 0 KICK_OFF 1 NaN 2022-09-11T20:26:31.880Z NaN 1 Touchdown 2022-09-11T20:26:28.687Z 11.0 5:23 5.0 1.0 1.0 1.0 1.0 0.0 KICKOFF TOUCHDOWN 15:00 09:37 KC 25 ARI 9 40.0 316.0 44 21 Home -23 65 -6.0 54.0 0 closed grass NaN NaN Kliff Kingsbury Andy Reid PHO00 State Farm Stadium 0 0.0 NaN NaN NaN NaN NaN NaN 0 0 0.0 1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 0.000000 NaN NaN NaN NaN NaN NaN NaN
1423 40 2022_01_LV_LAC 2022091111 LAC LV REG 1 LAC home LV LV 35.0 2022-09-11 900 1800 3600 Half1 0 1.0 0 1 NaN 0 15:00 LV 35 0 50.0 2-D.Carlson kicks 65 yards from LV 35 to end z... kickoff 0.0 0 0 0.0 0 0 0 NaN NaN NaN NaN NaN NaN NaN 65.0 NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.003488 0.133672 0.002093 0.210531 0.203776 0.003241 0.443199 0.0 0.0 1.841284 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.546262 0.453738 0.546262 0.453738 0.000000 0.000000 0.000000 0.546262 0.453738 0.614067 0.614067 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN D.Carlson 00-0034161 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN LAC 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 NaN NaN 1 1 First down 40 9/11/22, 16:26:32 2022-09-11T20:26:33.297Z SoFi Stadium Cloudy Temp: 86° F, Humidity: 56%, Wind: SSW 3... 7ae7f864-d24c-11ec-b23d-d15a91047884 0 0 KICK_OFF 1 NaN 2022-09-11T20:26:37.777Z NaN 1 Field goal 2022-09-11T20:26:33.297Z 12.0 5:57 3.0 0.0 1.0 1.0 1.0 0.0 KICKOFF FIELD_GOAL 15:00 09:03 LAC 25 LV 25 40.0 302.0 19 24 Home 5 43 3.5 52.5 1 dome matrixturf NaN NaN Brandon Staley Josh McDaniels LAX01 SoFi Stadium 0 0.0 NaN NaN NaN NaN NaN NaN 0 0 0.0 1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 1 0.000000 NaN NaN NaN NaN NaN NaN NaN
1927 40 2022_01_NYG_TEN 2022091108 TEN NYG REG 1 NYG away TEN TEN 35.0 2022-09-11 900 1800 3600 Half1 0 1.0 0 1 NaN 0 15:00 TEN 35 0 9.0 14-R.Bullock kicks 67 yards from TEN 35 to NYG... kickoff 0.0 0 0 0.0 0 0 0 NaN NaN NaN NaN NaN NaN NaN 67.0 NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.004568 0.143585 0.002325 0.275986 0.215226 0.003265 0.355046 0.0 0.0 0.770222 -0.374267 0.374267 -0.374267 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.433208 0.566792 0.566792 0.433208 0.000701 -0.003666 0.003666 0.566091 0.433909 0.294656 0.705344 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN G.Brightwell 00-0036569 NaN NaN NaN NaN R.Bullock 00-0029421 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 00-0033659 J.Jones TEN 00-0034164 T.Cannon TEN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NYG 24.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 NaN NaN 1 0 Punt 40 9/11/22, 16:26:15 2022-09-11T20:26:15.883Z Nissan Stadium Cloudy Temp: 75° F, Humidity: 87%, Wind: WSW 9... 7ae7e4c2-d24c-11ec-b23d-d15a91047884 0 0 KICK_OFF 1 NaN 2022-09-11T20:26:24.423Z NaN 1 Punt 2022-09-11T20:26:15.883Z 3.0 2:12 0.0 0.0 0.0 1.0 1.0 0.0 KICKOFF PUNT 15:00 12:48 NYG 22 NYG 31 40.0 125.0 21 20 Home -1 41 5.5 43.5 0 outdoors grass NaN NaN Mike Vrabel Brian Daboll NAS00 Nissan Stadium 0 0.0 NaN NaN NaN NaN NaN NaN 0 0 0.0 1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 -0.374267 NaN NaN NaN NaN NaN NaN NaN
178 41 2022_01_BUF_LA 2022090800 LA BUF REG 1 BUF away LA LA 35.0 2022-09-08 900 1800 3600 Half1 0 1.0 0 1 NaN 0 15:00 LA 35 0 75.0 8-M.Gay kicks 65 yards from LA 35 to end zone,... kickoff 0.0 0 0 0.0 0 0 0 NaN NaN NaN NaN NaN NaN NaN 65.0 NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.003473 0.128879 0.002270 0.272088 0.208195 0.003240 0.381854 0.0 0.0 1.008251 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.433208 0.566792 0.566792 0.433208 0.000000 0.000000 0.000000 0.566792 0.433208 0.549646 0.450354 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN M.Gay 00-0035269 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN BUF 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 NaN NaN 1 1 First down 41 9/8/22, 20:23:17 2022-09-09T00:23:17.820Z SoFi Stadium Cloudy Temp: 88° F, Humidity: 48%, Wind: W 8 mph 7ae7944e-d24c-11ec-b23d-d15a91047884 0 0 KICK_OFF 1 NaN 2022-09-09T00:23:21.857Z NaN 1 Touchdown 2022-09-09T00:23:17.820Z 9.0 5:04 4.0 0.0 1.0 1.0 1.0 0.0 KICKOFF TOUCHDOWN 15:00 09:56 BUF 25 LA 26 41.0 261.0 31 10 Home -21 41 -1.0 51.5 0 dome matrixturf NaN NaN Sean McVay Sean McDermott LAX01 SoFi Stadium 0 0.0 NaN NaN NaN NaN NaN NaN 0 0 0.0 1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 0.000000 NaN NaN NaN NaN NaN NaN NaN
In [4]:
number_of_nan_play_types_per_game = play_by_play_2023_df[play_by_play_2023_df.play_type.isna()].groupby(["game_id"], as_index=False).agg({"play_id": "count"})
number_of_nan_play_types_per_game.head()
Out[4]:
game_id play_id
0 2022_01_BAL_NYJ 5
1 2022_01_BUF_LA 5
2 2022_01_CLE_CAR 5
3 2022_01_DEN_SEA 5
4 2022_01_GB_MIN 5
In [5]:
play_by_play_2023_df.where(play_by_play_2023_df.game_id == "2022_10_WAS_PHI").dropna(subset=["game_id"]).head()
Out[5]:
play_id game_id old_game_id home_team away_team season_type week posteam posteam_type defteam side_of_field yardline_100 game_date quarter_seconds_remaining half_seconds_remaining game_seconds_remaining game_half quarter_end drive sp qtr down goal_to_go time yrdln ydstogo ydsnet desc play_type yards_gained shotgun no_huddle qb_dropback qb_kneel qb_spike qb_scramble pass_length pass_location air_yards yards_after_catch run_location run_gap field_goal_result kick_distance extra_point_result two_point_conv_result home_timeouts_remaining away_timeouts_remaining timeout timeout_team td_team td_player_name td_player_id posteam_timeouts_remaining defteam_timeouts_remaining total_home_score total_away_score posteam_score defteam_score score_differential posteam_score_post defteam_score_post score_differential_post no_score_prob opp_fg_prob opp_safety_prob opp_td_prob fg_prob safety_prob td_prob extra_point_prob two_point_conversion_prob ep epa total_home_epa total_away_epa total_home_rush_epa total_away_rush_epa total_home_pass_epa total_away_pass_epa air_epa yac_epa comp_air_epa comp_yac_epa total_home_comp_air_epa total_away_comp_air_epa total_home_comp_yac_epa total_away_comp_yac_epa total_home_raw_air_epa total_away_raw_air_epa total_home_raw_yac_epa total_away_raw_yac_epa wp def_wp home_wp away_wp wpa vegas_wpa vegas_home_wpa home_wp_post away_wp_post vegas_wp vegas_home_wp total_home_rush_wpa total_away_rush_wpa total_home_pass_wpa total_away_pass_wpa air_wpa yac_wpa comp_air_wpa comp_yac_wpa total_home_comp_air_wpa total_away_comp_air_wpa total_home_comp_yac_wpa total_away_comp_yac_wpa total_home_raw_air_wpa total_away_raw_air_wpa total_home_raw_yac_wpa total_away_raw_yac_wpa punt_blocked first_down_rush first_down_pass first_down_penalty third_down_converted third_down_failed fourth_down_converted fourth_down_failed incomplete_pass touchback interception punt_inside_twenty punt_in_endzone punt_out_of_bounds punt_downed punt_fair_catch kickoff_inside_twenty kickoff_in_endzone kickoff_out_of_bounds kickoff_downed kickoff_fair_catch fumble_forced fumble_not_forced fumble_out_of_bounds solo_tackle safety penalty tackled_for_loss fumble_lost own_kickoff_recovery own_kickoff_recovery_td qb_hit rush_attempt pass_attempt sack touchdown pass_touchdown rush_touchdown return_touchdown extra_point_attempt two_point_attempt field_goal_attempt kickoff_attempt punt_attempt fumble complete_pass assist_tackle lateral_reception lateral_rush lateral_return lateral_recovery passer_player_id passer_player_name passing_yards receiver_player_id receiver_player_name receiving_yards rusher_player_id rusher_player_name rushing_yards lateral_receiver_player_id lateral_receiver_player_name lateral_receiving_yards lateral_rusher_player_id lateral_rusher_player_name lateral_rushing_yards lateral_sack_player_id lateral_sack_player_name interception_player_id interception_player_name lateral_interception_player_id lateral_interception_player_name punt_returner_player_id punt_returner_player_name lateral_punt_returner_player_id lateral_punt_returner_player_name kickoff_returner_player_name kickoff_returner_player_id lateral_kickoff_returner_player_id lateral_kickoff_returner_player_name punter_player_id punter_player_name kicker_player_name kicker_player_id own_kickoff_recovery_player_id own_kickoff_recovery_player_name blocked_player_id blocked_player_name tackle_for_loss_1_player_id tackle_for_loss_1_player_name tackle_for_loss_2_player_id tackle_for_loss_2_player_name qb_hit_1_player_id qb_hit_1_player_name qb_hit_2_player_id qb_hit_2_player_name forced_fumble_player_1_team forced_fumble_player_1_player_id forced_fumble_player_1_player_name forced_fumble_player_2_team forced_fumble_player_2_player_id forced_fumble_player_2_player_name solo_tackle_1_team solo_tackle_2_team solo_tackle_1_player_id solo_tackle_2_player_id solo_tackle_1_player_name solo_tackle_2_player_name assist_tackle_1_player_id assist_tackle_1_player_name assist_tackle_1_team assist_tackle_2_player_id assist_tackle_2_player_name assist_tackle_2_team assist_tackle_3_player_id assist_tackle_3_player_name assist_tackle_3_team assist_tackle_4_player_id assist_tackle_4_player_name assist_tackle_4_team tackle_with_assist tackle_with_assist_1_player_id tackle_with_assist_1_player_name tackle_with_assist_1_team tackle_with_assist_2_player_id tackle_with_assist_2_player_name tackle_with_assist_2_team pass_defense_1_player_id pass_defense_1_player_name pass_defense_2_player_id pass_defense_2_player_name fumbled_1_team fumbled_1_player_id fumbled_1_player_name fumbled_2_player_id fumbled_2_player_name fumbled_2_team fumble_recovery_1_team fumble_recovery_1_yards fumble_recovery_1_player_id fumble_recovery_1_player_name fumble_recovery_2_team fumble_recovery_2_yards fumble_recovery_2_player_id fumble_recovery_2_player_name sack_player_id sack_player_name half_sack_1_player_id half_sack_1_player_name half_sack_2_player_id half_sack_2_player_name return_team return_yards penalty_team penalty_player_id penalty_player_name penalty_yards replay_or_challenge replay_or_challenge_result penalty_type defensive_two_point_attempt defensive_two_point_conv defensive_extra_point_attempt defensive_extra_point_conv safety_player_name safety_player_id season cp cpoe series series_success series_result order_sequence start_time time_of_day stadium weather nfl_api_id play_clock play_deleted play_type_nfl special_teams_play st_play_type end_clock_time end_yard_line fixed_drive fixed_drive_result drive_real_start_time drive_play_count drive_time_of_possession drive_first_downs drive_inside20 drive_ended_with_score drive_quarter_start drive_quarter_end drive_yards_penalized drive_start_transition drive_end_transition drive_game_clock_start drive_game_clock_end drive_start_yard_line drive_end_yard_line drive_play_id_started drive_play_id_ended away_score home_score location result total spread_line total_line div_game roof surface temp wind home_coach away_coach stadium_id game_stadium aborted_play success passer passer_jersey_number rusher rusher_jersey_number receiver receiver_jersey_number pass rush first_down special play passer_id rusher_id receiver_id name jersey_number id fantasy_player_name fantasy_player_id fantasy fantasy_id out_of_bounds home_opening_kickoff qb_epa xyac_epa xyac_mean_yardage xyac_median_yardage xyac_success xyac_fd xpass pass_oe
25978 1.0 2022_10_WAS_PHI 2.022111e+09 PHI WAS REG 10.0 NaN NaN NaN NaN NaN 2022-11-14 900.0 1800.0 3600.0 Half1 0.0 NaN 0.0 1.0 NaN 0.0 15:00 PHI 35 0.0 NaN GAME NaN NaN 0.0 0.0 NaN 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.0 3.0 NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.770222 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.433208 0.566792 0.566792 0.433208 0.000000 0.000000 0.000000 NaN NaN 0.162191 0.837809 0.000000 0.000000 0.00000 0.00000 NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN 2022.0 NaN NaN 1.0 1.0 First down 1.0 11/14/22, 20:15:23 NaN Lincoln Financial Field Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph 9574c667-d24c-11ec-b23d-d15a91047884 0.0 0.0 GAME_START 0.0 NaN NaN NaN 1.0 Turnover NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32.0 21.0 Home -11.0 53.0 11.0 43.0 1.0 outdoors grass 43.0 6.0 Nick Sirianni Ron Rivera PHI00 Lincoln Financial Field 0.0 0.0 NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.000000 NaN NaN NaN NaN NaN NaN NaN
25979 41.0 2022_10_WAS_PHI 2.022111e+09 PHI WAS REG 10.0 WAS away PHI PHI 35.0 2022-11-14 900.0 1800.0 3600.0 Half1 0.0 1.0 0.0 1.0 NaN 0.0 15:00 PHI 35 0.0 10.0 4-J.Elliott kicks 63 yards from PHI 35 to WAS ... kickoff 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 63.0 NaN NaN 3.0 3.0 0.0 NaN NaN NaN NaN 3.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.004568 0.143585 0.002325 0.275986 0.215226 0.003265 0.355046 0.0 0.0 0.770222 -1.217267 1.217267 -1.217267 0.000000 0.000000 0.000000 0.000000 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.433208 0.566792 0.566792 0.433208 -0.021700 -0.022272 0.022272 0.588493 0.411507 0.162191 0.837809 0.000000 0.000000 0.00000 0.00000 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN A.Gibson 00-0036328 NaN NaN NaN NaN J.Elliott 00-0033787 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 00-0037615 N.Dean PHI NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 00-0034623 A.Chachere PHI NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN WAS 14.0 WAS 00-0037168 A.Rogers 8.0 0.0 NaN Offensive Holding 0.0 0.0 0.0 0.0 NaN NaN 2022.0 NaN NaN 1.0 1.0 First down 41.0 11/14/22, 20:15:23 2022-11-15T01:15:23Z Lincoln Financial Field Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph 9574c667-d24c-11ec-b23d-d15a91047884 0.0 0.0 KICK_OFF 1.0 NaN NaN NaN 1.0 Turnover 2022-11-15T01:15:23Z 4.0 1:48 1.0 0.0 0.0 1.0 1.0 15.0 KICKOFF FUMBLE 15:00 13:12 WAS 8 WAS 28 41.0 174.0 32.0 21.0 Home -11.0 53.0 11.0 43.0 1.0 outdoors grass 43.0 6.0 Nick Sirianni Ron Rivera PHI00 Lincoln Financial Field 0.0 0.0 NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 -1.217267 NaN NaN NaN NaN NaN NaN NaN
25980 74.0 2022_10_WAS_PHI 2.022111e+09 PHI WAS REG 10.0 WAS away PHI WAS 92.0 2022-11-14 892.0 1792.0 3592.0 Half1 0.0 1.0 0.0 1.0 1.0 0.0 14:52 WAS 8 10.0 10.0 (14:52) (Shotgun) 8-B.Robinson up the middle t... run 3.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN middle NaN NaN NaN NaN NaN 3.0 3.0 0.0 NaN NaN NaN NaN 3.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.004857 0.194346 0.014364 0.326385 0.190744 0.001588 0.267716 0.0 0.0 -0.447044 -0.289755 1.507021 -1.507021 0.289755 -0.289755 0.000000 0.000000 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.411507 0.588493 0.588493 0.411507 -0.006098 -0.003941 0.003941 0.594590 0.405410 0.139919 0.860081 0.006098 -0.006098 0.00000 0.00000 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN 00-0037746 B.Robinson 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 00-0034381 J.Sweat PHI 00-0036920 M.Tuipulotu PHI NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022.0 NaN NaN 1.0 1.0 First down 74.0 11/14/22, 20:15:23 2022-11-15T01:16:37Z Lincoln Financial Field Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph 9574c667-d24c-11ec-b23d-d15a91047884 0.0 0.0 RUSH 0.0 NaN NaN NaN 1.0 Turnover 2022-11-15T01:15:23Z 4.0 1:48 1.0 0.0 0.0 1.0 1.0 15.0 KICKOFF FUMBLE 15:00 13:12 WAS 8 WAS 28 41.0 174.0 32.0 21.0 Home -11.0 53.0 11.0 43.0 1.0 outdoors grass 43.0 6.0 Nick Sirianni Ron Rivera PHI00 Lincoln Financial Field 0.0 0.0 NaN NaN B.Robinson 8.0 NaN NaN 0.0 1.0 0.0 0.0 1.0 NaN 00-0037746 NaN B.Robinson 8.0 00-0037746 B.Robinson 00-0037746 B.Robinson 00-0037746 0.0 0.0 -0.289755 NaN NaN NaN NaN NaN 0.361042 -36.104223
25981 95.0 2022_10_WAS_PHI 2.022111e+09 PHI WAS REG 10.0 WAS away PHI WAS 89.0 2022-11-14 859.0 1759.0 3559.0 Half1 0.0 1.0 0.0 1.0 2.0 0.0 14:19 WAS 11 7.0 10.0 (14:19) (Shotgun) 8-B.Robinson right guard to ... run 2.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN right guard NaN NaN NaN NaN 3.0 3.0 0.0 NaN NaN NaN NaN 3.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.004976 0.203961 0.005102 0.350961 0.172904 0.002273 0.259823 0.0 0.0 -0.736799 -0.448021 1.955042 -1.955042 0.737775 -0.737775 0.000000 0.000000 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.00000 0.405410 0.594590 0.594590 0.405410 -0.021418 -0.001524 0.001524 0.616008 0.383992 0.135978 0.864022 0.027516 -0.027516 0.00000 0.00000 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN 00-0037746 B.Robinson 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 00-0036920 M.Tuipulotu PHI 00-0029653 F.Cox PHI NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022.0 NaN NaN 1.0 1.0 First down 95.0 11/14/22, 20:15:23 2022-11-15T01:17:10Z Lincoln Financial Field Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph 9574c667-d24c-11ec-b23d-d15a91047884 0.0 0.0 RUSH 0.0 NaN NaN NaN 1.0 Turnover 2022-11-15T01:15:23Z 4.0 1:48 1.0 0.0 0.0 1.0 1.0 15.0 KICKOFF FUMBLE 15:00 13:12 WAS 8 WAS 28 41.0 174.0 32.0 21.0 Home -11.0 53.0 11.0 43.0 1.0 outdoors grass 43.0 6.0 Nick Sirianni Ron Rivera PHI00 Lincoln Financial Field 0.0 0.0 NaN NaN B.Robinson 8.0 NaN NaN 0.0 1.0 0.0 0.0 1.0 NaN 00-0037746 NaN B.Robinson 8.0 00-0037746 B.Robinson 00-0037746 B.Robinson 00-0037746 0.0 0.0 -0.448021 NaN NaN NaN NaN NaN 0.556989 -55.698919
25982 116.0 2022_10_WAS_PHI 2.022111e+09 PHI WAS REG 10.0 WAS away PHI WAS 87.0 2022-11-14 817.0 1717.0 3517.0 Half1 0.0 1.0 0.0 1.0 3.0 0.0 13:37 WAS 13 5.0 10.0 (13:37) (Shotgun) 4-T.Heinicke pass incomplete... pass 0.0 1.0 0.0 1.0 0.0 0.0 0.0 deep left 20.0 NaN NaN NaN NaN NaN NaN NaN 3.0 3.0 0.0 NaN NaN NaN NaN 3.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.005277 0.234588 0.003712 0.366850 0.154731 0.002753 0.232089 0.0 0.0 -1.184820 -1.298478 3.253520 -3.253520 0.737775 -0.737775 1.298478 -1.298478 2.561982 -3.86046 0.0 0.0 0.0 0.0 0.0 0.0 -2.561982 2.561982 3.86046 -3.86046 0.383992 0.616008 0.616008 0.383992 -0.032710 -0.015145 0.015145 0.648718 0.351282 0.134454 0.865546 0.027516 -0.027516 0.03271 -0.03271 0.0 -0.03271 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.03271 -0.03271 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 00-0031800 T.Heinicke NaN 00-0033282 C.Samuel NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN 00-0036303 J.Scott NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022.0 0.407574 -40.757403 1.0 1.0 First down 116.0 11/14/22, 20:15:23 2022-11-15T01:17:52Z Lincoln Financial Field Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph 9574c667-d24c-11ec-b23d-d15a91047884 0.0 0.0 PASS 0.0 NaN NaN NaN 1.0 Turnover 2022-11-15T01:15:23Z 4.0 1:48 1.0 0.0 0.0 1.0 1.0 15.0 KICKOFF FUMBLE 15:00 13:12 WAS 8 WAS 28 41.0 174.0 32.0 21.0 Home -11.0 53.0 11.0 43.0 1.0 outdoors grass 43.0 6.0 Nick Sirianni Ron Rivera PHI00 Lincoln Financial Field 0.0 0.0 T.Heinicke 4.0 NaN NaN C.Samuel 10.0 1.0 0.0 0.0 0.0 1.0 00-0031800 NaN 00-0033282 T.Heinicke 4.0 00-0031800 C.Samuel 00-0033282 C.Samuel 00-0033282 0.0 0.0 -1.298478 0.388345 5.862539 3.0 1.0 1.0 0.950282 4.971772
In [6]:
eagles_loss_to_wash_df = play_by_play_2023_df[play_by_play_2023_df.game_id == "2022_10_WAS_PHI"]
In [7]:
for desc in eagles_loss_to_wash_df[eagles_loss_to_wash_df.drive == 1.0].desc:
    print(desc)
    print("===")
4-J.Elliott kicks 63 yards from PHI 35 to WAS 2. 24-A.Gibson to WAS 43 for 41 yards (21-A.Chachere, 17-N.Dean). PENALTY on WAS-88-A.Rogers, Offensive Holding, 8 yards, enforced at WAS 16.
===
(14:52) (Shotgun) 8-B.Robinson up the middle to WAS 11 for 3 yards (94-J.Sweat; 95-M.Tuipulotu).
===
(14:19) (Shotgun) 8-B.Robinson right guard to WAS 13 for 2 yards (95-M.Tuipulotu; 91-F.Cox).
===
(13:37) (Shotgun) 4-T.Heinicke pass incomplete deep left to 10-C.Samuel (33-J.Scott).
===
(13:32) 5-T.Way punts 47 yards to PHI 40, Center-54-C.Cheeseman. 18-B.Covey to PHI 48 for 8 yards (58-S.Toney; 39-J.Reaves). PENALTY on PHI-32-R.Blankenship, Roughing the Kicker, 15 yards, enforced at WAS 13 - No Play.
===
(13:20) (Shotgun) 4-T.Heinicke sacked at WAS 18 for -10 yards (94-J.Sweat). FUMBLES (94-J.Sweat) [94-J.Sweat], RECOVERED by PHI-95-M.Tuipulotu at WAS 18.
===

Goal is to build a training dataset which has input features with respect to the next play and predict the probability of a sack

  • Features to try and collect include
    • all the down and distance and circumstancial information about the play
    • current score of the game
In [ ]:
 
In [8]:
sacks_in_2022 = play_by_play_2023_df[
    (play_by_play_2023_df.play_type == "pass")
    # & (play_by_play_2023_df.sack == 1.0)
]
sacks_in_2022.head()
Out[8]:
play_id game_id old_game_id home_team away_team season_type week posteam posteam_type defteam side_of_field yardline_100 game_date quarter_seconds_remaining half_seconds_remaining game_seconds_remaining game_half quarter_end drive sp qtr down goal_to_go time yrdln ydstogo ydsnet desc play_type yards_gained shotgun no_huddle qb_dropback qb_kneel qb_spike qb_scramble pass_length pass_location air_yards yards_after_catch run_location run_gap field_goal_result kick_distance extra_point_result two_point_conv_result home_timeouts_remaining away_timeouts_remaining timeout timeout_team td_team td_player_name td_player_id posteam_timeouts_remaining defteam_timeouts_remaining total_home_score total_away_score posteam_score defteam_score score_differential posteam_score_post defteam_score_post score_differential_post no_score_prob opp_fg_prob opp_safety_prob opp_td_prob fg_prob safety_prob td_prob extra_point_prob two_point_conversion_prob ep epa total_home_epa total_away_epa total_home_rush_epa total_away_rush_epa total_home_pass_epa total_away_pass_epa air_epa yac_epa comp_air_epa comp_yac_epa total_home_comp_air_epa total_away_comp_air_epa total_home_comp_yac_epa total_away_comp_yac_epa total_home_raw_air_epa total_away_raw_air_epa total_home_raw_yac_epa total_away_raw_yac_epa wp def_wp home_wp away_wp wpa vegas_wpa vegas_home_wpa home_wp_post away_wp_post vegas_wp vegas_home_wp total_home_rush_wpa total_away_rush_wpa total_home_pass_wpa total_away_pass_wpa air_wpa yac_wpa comp_air_wpa comp_yac_wpa total_home_comp_air_wpa total_away_comp_air_wpa total_home_comp_yac_wpa total_away_comp_yac_wpa total_home_raw_air_wpa total_away_raw_air_wpa total_home_raw_yac_wpa total_away_raw_yac_wpa punt_blocked first_down_rush first_down_pass first_down_penalty third_down_converted third_down_failed fourth_down_converted fourth_down_failed incomplete_pass touchback interception punt_inside_twenty punt_in_endzone punt_out_of_bounds punt_downed punt_fair_catch kickoff_inside_twenty kickoff_in_endzone kickoff_out_of_bounds kickoff_downed kickoff_fair_catch fumble_forced fumble_not_forced fumble_out_of_bounds solo_tackle safety penalty tackled_for_loss fumble_lost own_kickoff_recovery own_kickoff_recovery_td qb_hit rush_attempt pass_attempt sack touchdown pass_touchdown rush_touchdown return_touchdown extra_point_attempt two_point_attempt field_goal_attempt kickoff_attempt punt_attempt fumble complete_pass assist_tackle lateral_reception lateral_rush lateral_return lateral_recovery passer_player_id passer_player_name passing_yards receiver_player_id receiver_player_name receiving_yards rusher_player_id rusher_player_name rushing_yards lateral_receiver_player_id lateral_receiver_player_name lateral_receiving_yards lateral_rusher_player_id lateral_rusher_player_name lateral_rushing_yards lateral_sack_player_id lateral_sack_player_name interception_player_id interception_player_name lateral_interception_player_id lateral_interception_player_name punt_returner_player_id punt_returner_player_name lateral_punt_returner_player_id lateral_punt_returner_player_name kickoff_returner_player_name kickoff_returner_player_id lateral_kickoff_returner_player_id lateral_kickoff_returner_player_name punter_player_id punter_player_name kicker_player_name kicker_player_id own_kickoff_recovery_player_id own_kickoff_recovery_player_name blocked_player_id blocked_player_name tackle_for_loss_1_player_id tackle_for_loss_1_player_name tackle_for_loss_2_player_id tackle_for_loss_2_player_name qb_hit_1_player_id qb_hit_1_player_name qb_hit_2_player_id qb_hit_2_player_name forced_fumble_player_1_team forced_fumble_player_1_player_id forced_fumble_player_1_player_name forced_fumble_player_2_team forced_fumble_player_2_player_id forced_fumble_player_2_player_name solo_tackle_1_team solo_tackle_2_team solo_tackle_1_player_id solo_tackle_2_player_id solo_tackle_1_player_name solo_tackle_2_player_name assist_tackle_1_player_id assist_tackle_1_player_name assist_tackle_1_team assist_tackle_2_player_id assist_tackle_2_player_name assist_tackle_2_team assist_tackle_3_player_id assist_tackle_3_player_name assist_tackle_3_team assist_tackle_4_player_id assist_tackle_4_player_name assist_tackle_4_team tackle_with_assist tackle_with_assist_1_player_id tackle_with_assist_1_player_name tackle_with_assist_1_team tackle_with_assist_2_player_id tackle_with_assist_2_player_name tackle_with_assist_2_team pass_defense_1_player_id pass_defense_1_player_name pass_defense_2_player_id pass_defense_2_player_name fumbled_1_team fumbled_1_player_id fumbled_1_player_name fumbled_2_player_id fumbled_2_player_name fumbled_2_team fumble_recovery_1_team fumble_recovery_1_yards fumble_recovery_1_player_id fumble_recovery_1_player_name fumble_recovery_2_team fumble_recovery_2_yards fumble_recovery_2_player_id fumble_recovery_2_player_name sack_player_id sack_player_name half_sack_1_player_id half_sack_1_player_name half_sack_2_player_id half_sack_2_player_name return_team return_yards penalty_team penalty_player_id penalty_player_name penalty_yards replay_or_challenge replay_or_challenge_result penalty_type defensive_two_point_attempt defensive_two_point_conv defensive_extra_point_attempt defensive_extra_point_conv safety_player_name safety_player_id season cp cpoe series series_success series_result order_sequence start_time time_of_day stadium weather nfl_api_id play_clock play_deleted play_type_nfl special_teams_play st_play_type end_clock_time end_yard_line fixed_drive fixed_drive_result drive_real_start_time drive_play_count drive_time_of_possession drive_first_downs drive_inside20 drive_ended_with_score drive_quarter_start drive_quarter_end drive_yards_penalized drive_start_transition drive_end_transition drive_game_clock_start drive_game_clock_end drive_start_yard_line drive_end_yard_line drive_play_id_started drive_play_id_ended away_score home_score location result total spread_line total_line div_game roof surface temp wind home_coach away_coach stadium_id game_stadium aborted_play success passer passer_jersey_number rusher rusher_jersey_number receiver receiver_jersey_number pass rush first_down special play passer_id rusher_id receiver_id name jersey_number id fantasy_player_name fantasy_player_id fantasy fantasy_id out_of_bounds home_opening_kickoff qb_epa xyac_epa xyac_mean_yardage xyac_median_yardage xyac_success xyac_fd xpass pass_oe
3 89 2022_01_BAL_NYJ 2022091107 NYJ BAL REG 1 NYJ home BAL NYJ 59.0 2022-09-11 869 1769 3569 Half1 0 1.0 0 1 1.0 0 14:29 NYJ 41 10 14.0 (14:29) (No Huddle, Shotgun) 19-J.Flacco pass ... pass 0.0 1 1 1.0 0 0 0 short left 0.0 NaN NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.004506 0.109583 0.001611 0.167983 0.245590 0.004921 0.465806 0.0 0.0 2.499396 -0.492192 0.533106 -0.533106 1.468819 -1.468819 -0.492192 0.492192 -0.492192 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.492192 0.492192 0.000000 0.000000 0.572573 0.427427 0.572573 0.427427 -0.018037 -0.016770 -0.016770 0.554537 0.445463 0.280103 0.280103 0.025604 -0.025604 -0.018037 0.018037 0.0 -0.018037 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 -0.018037 0.018037 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 00-0026158 J.Flacco NaN 00-0036924 Mi.Carter NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 0.743398 -74.339849 2 0 Punt 89 9/11/22, 13:05:56 2022-09-11T17:07:04.757Z MetLife Stadium Rain and mid 70s Temp: 73° F, Humidity: 79%, W... 7ae7de72-d24c-11ec-b23d-d15a91047884 0 0 PASS 0 NaN 2022-09-11T17:07:08.393Z NaN 1 Punt 2022-09-11T17:05:56.987Z 4.0 1:18 1.0 0.0 0.0 1.0 1.0 -10.0 KICKOFF PUNT 15:00 13:42 NYJ 22 NYJ 36 43.0 172.0 24 9 Home -15 33 -6.5 44.0 0 outdoors fieldturf NaN NaN Robert Saleh John Harbaugh NYC01 MetLife Stadium 0 0.0 J.Flacco 19.0 NaN NaN Mi.Carter 32.0 1 0 0.0 0 1 00-0026158 NaN 00-0036924 J.Flacco 19.0 00-0026158 Mi.Carter 00-0036924 Mi.Carter 00-0036924 0 1 -0.492192 0.727261 6.988125 6.0 0.606930 0.227598 0.389904 61.009598
5 136 2022_01_BAL_NYJ 2022091107 NYJ BAL REG 1 NYJ home BAL NYJ 54.0 2022-09-11 841 1741 3541 Half1 0 1.0 0 1 3.0 0 14:01 NYJ 46 5 14.0 (14:01) (No Huddle, Shotgun) 19-J.Flacco pass ... pass 0.0 1 1 1.0 0 0 0 short right 0.0 NaN NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.005385 0.142873 0.001634 0.210421 0.215781 0.005716 0.418190 0.0 0.0 1.681272 -2.402200 -2.195026 2.195026 1.142888 -1.142888 -2.894393 2.894393 -1.672078 -0.730123 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -2.164270 2.164270 -0.730123 0.730123 0.540167 0.459833 0.540167 0.459833 -0.052114 -0.069870 -0.069870 0.488053 0.511947 0.257954 0.257954 0.011235 -0.011235 -0.070151 0.070151 0.0 -0.052114 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 -0.070151 0.070151 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 00-0026158 J.Flacco NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 00-0026190 C.Campbell NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NYJ 00-0026158 J.Flacco 10.0 0 NaN Intentional Grounding 0.0 0.0 0.0 0.0 NaN NaN 2022 NaN NaN 2 0 Punt 136 9/11/22, 13:05:56 2022-09-11T17:07:51.300Z MetLife Stadium Rain and mid 70s Temp: 73° F, Humidity: 79%, W... 7ae7de72-d24c-11ec-b23d-d15a91047884 0 0 PASS 0 NaN 2022-09-11T17:07:55.760Z NaN 1 Punt 2022-09-11T17:05:56.987Z 4.0 1:18 1.0 0.0 0.0 1.0 1.0 -10.0 KICKOFF PUNT 15:00 13:42 NYJ 22 NYJ 36 43.0 172.0 24 9 Home -15 33 -6.5 44.0 0 outdoors fieldturf NaN NaN Robert Saleh John Harbaugh NYC01 MetLife Stadium 0 0.0 J.Flacco 19.0 NaN NaN NaN NaN 1 0 0.0 0 1 00-0026158 NaN NaN J.Flacco 19.0 00-0026158 NaN NaN NaN NaN 0 1 -2.402200 NaN NaN NaN NaN NaN 0.963242 3.675753
7 202 2022_01_BAL_NYJ 2022091107 NYJ BAL REG 1 BAL away NYJ BAL 72.0 2022-09-11 822 1722 3522 Half1 0 2.0 0 1 1.0 0 13:42 BAL 28 10 21.0 (13:42) 8-L.Jackson pass short right to 7-R.Ba... pass 4.0 0 0 1.0 0 0 0 short right -4.0 8.0 NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.005271 0.136448 0.002322 0.265246 0.223840 0.003272 0.363601 0.0 0.0 0.952560 0.075127 -2.501785 2.501785 1.142888 -1.142888 -2.969520 2.969520 -1.309201 1.384328 -1.309201 1.384328 1.309201 -1.309201 -1.384328 1.384328 -0.855070 0.855070 -2.114451 2.114451 0.495820 0.504180 0.504180 0.495820 0.000774 0.008566 -0.008566 0.503406 0.496594 0.775531 0.224469 0.011235 -0.011235 -0.070925 0.070925 0.0 0.000774 0.0 0.000774 0.0 0.0 -0.000774 0.000774 0.0 0.0 -0.070925 0.070925 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 00-0034796 L.Jackson 4.0 00-0036550 R.Bateman 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NYJ NaN 00-0034374 NaN J.Whitehead NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 0.880052 11.994767 3 1 First down 202 9/11/22, 13:05:56 2022-09-11T17:12:11.027Z MetLife Stadium Rain and mid 70s Temp: 73° F, Humidity: 79%, W... 7ae7de72-d24c-11ec-b23d-d15a91047884 0 0 PASS 0 NaN 2022-09-11T17:12:17.433Z NaN 2 Punt 2022-09-11T17:12:11.027Z 6.0 3:53 1.0 0.0 0.0 1.0 1.0 0.0 PUNT PUNT 13:42 09:49 BAL 28 BAL 49 202.0 368.0 24 9 Home -15 33 -6.5 44.0 0 outdoors fieldturf NaN NaN Robert Saleh John Harbaugh NYC01 MetLife Stadium 0 1.0 L.Jackson 8.0 NaN NaN R.Bateman 7.0 1 0 0.0 0 1 00-0034796 NaN 00-0036550 L.Jackson 8.0 00-0034796 R.Bateman 00-0036550 R.Bateman 00-0036550 1 1 0.075127 1.480030 10.545964 9.0 0.606959 0.241949 0.479318 52.068213
8 230 2022_01_BAL_NYJ 2022091107 NYJ BAL REG 1 BAL away NYJ BAL 68.0 2022-09-11 801 1701 3501 Half1 0 2.0 0 1 2.0 0 13:21 BAL 32 6 21.0 (13:21) (No Huddle, Shotgun) 8-L.Jackson pass ... pass 4.0 1 1 1.0 0 0 0 short left 3.0 1.0 NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.005439 0.142576 0.002563 0.254452 0.226579 0.003354 0.365037 0.0 0.0 1.027688 -0.105120 -2.396665 2.396665 1.142888 -1.142888 -2.864400 2.864400 -0.469484 0.364364 -0.469484 0.364364 1.778685 -1.778685 -1.748692 1.748692 -0.385585 0.385585 -2.478815 2.478815 0.496594 0.503406 0.503406 0.496594 0.002137 0.002200 -0.002200 0.501268 0.498732 0.784097 0.215903 0.011235 -0.011235 -0.073062 0.073062 0.0 0.002137 0.0 0.002137 0.0 0.0 -0.002912 0.002912 0.0 0.0 -0.073062 0.073062 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 00-0034796 L.Jackson 4.0 00-0036331 D.Duvernay 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NYJ NaN 00-0031296 NaN C.Mosley NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 0.761786 23.821402 3 1 First down 230 9/11/22, 13:05:56 2022-09-11T17:12:35.220Z MetLife Stadium Rain and mid 70s Temp: 73° F, Humidity: 79%, W... 7ae7de72-d24c-11ec-b23d-d15a91047884 0 0 PASS 0 NaN 2022-09-11T17:12:38.357Z NaN 2 Punt 2022-09-11T17:12:11.027Z 6.0 3:53 1.0 0.0 0.0 1.0 1.0 0.0 PUNT PUNT 13:42 09:49 BAL 28 BAL 49 202.0 368.0 24 9 Home -15 33 -6.5 44.0 0 outdoors fieldturf NaN NaN Robert Saleh John Harbaugh NYC01 MetLife Stadium 0 0.0 L.Jackson 8.0 NaN NaN D.Duvernay 13.0 1 0 0.0 0 1 00-0034796 NaN 00-0036331 L.Jackson 8.0 00-0034796 D.Duvernay 00-0036331 D.Duvernay 00-0036331 0 1 -0.105120 0.950097 4.795807 3.0 0.652492 0.514376 0.608057 39.194345
11 301 2022_01_BAL_NYJ 2022091107 NYJ BAL REG 1 BAL away NYJ BAL 60.0 2022-09-11 679 1579 3379 Half1 0 2.0 0 1 2.0 0 11:19 BAL 40 10 21.0 (11:19) (Shotgun) 8-L.Jackson pass short left ... pass 8.0 1 0 1.0 0 0 0 short left 2.0 6.0 NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.006759 0.125873 0.001952 0.242858 0.240966 0.004833 0.376760 0.0 0.0 1.288348 0.411132 -3.173578 3.173578 0.777107 -0.777107 -3.275532 3.275532 -0.732492 1.143623 -0.732492 1.143623 2.511177 -2.511177 -2.892316 2.892316 0.346907 -0.346907 -3.622438 3.622438 0.506771 0.493229 0.493229 0.506771 0.007203 0.001069 -0.001069 0.486026 0.513974 0.765140 0.234860 0.003196 -0.003196 -0.080266 0.080266 0.0 0.007203 0.0 0.007203 0.0 0.0 -0.010115 0.010115 0.0 0.0 -0.080266 0.080266 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 00-0034796 L.Jackson 8.0 00-0036331 D.Duvernay 8.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NYJ NaN 00-0034384 NaN D.Reed NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2022 0.798420 20.157987 4 0 Punt 301 9/11/22, 13:05:56 2022-09-11T17:14:36.540Z MetLife Stadium Rain and mid 70s Temp: 73° F, Humidity: 79%, W... 7ae7de72-d24c-11ec-b23d-d15a91047884 0 0 PASS 0 NaN 2022-09-11T17:14:42.137Z NaN 2 Punt 2022-09-11T17:12:11.027Z 6.0 3:53 1.0 0.0 0.0 1.0 1.0 0.0 PUNT PUNT 13:42 09:49 BAL 28 BAL 49 202.0 368.0 24 9 Home -15 33 -6.5 44.0 0 outdoors fieldturf NaN NaN Robert Saleh John Harbaugh NYC01 MetLife Stadium 0 1.0 L.Jackson 8.0 NaN NaN D.Duvernay 13.0 1 0 0.0 0 1 00-0034796 NaN 00-0036331 L.Jackson 8.0 00-0034796 D.Duvernay 00-0036331 D.Duvernay 00-0036331 1 1 0.411132 0.953877 5.303057 4.0 0.534381 0.221042 0.565180 43.481982
In [9]:
alt.Chart(sacks_in_2022[~sacks_in_2022.down.isna()][["yardline_100", "sack", "down"]]).mark_bar().encode(
    x=alt.X("yardline_100:Q", bin=alt.Bin(maxbins=20)),
    y=alt.Y("mean(sack):Q", title="Percentage of pass plays resulting in a sack").stack(False),
    color=alt.Color("down:N", title="Down"),
    column=alt.Column("down:N", title="Down"),
    # facet=alt.Facet("defteam:N", columns=8, title="Defensive Team"),
    opacity=alt.value(0.75)
).properties(
    title=alt.Title("Percentage of pass plays resulting in a sack at a given yard line (Only 20(TBD: determine this later after cleaning up the eda section))", fontSize=25)
)
Out[9]:

TODO: Expound on the visualization¶

Build Predictive Models¶

  1. Build an initial model based on these simple circumstantial features
    1. Try logistic regression
    2. Try Naive Bayes
  2. Evaluate the model
  3. Expand on it with more complex input parameters about the players statistics
In [10]:
helpful_fields = ["yardline_100", "quarter_seconds_remaining", "qtr", "down", "ydstogo", "sack", "season"]
predictive_fields = ["yardline_100", "quarter_seconds_remaining", "qtr", "down", "ydstogo", "sack"]
  • I'm going to move forward assuming I am given the knowledge that it is going to be a pass play. However, in the future it would be great to extend this model to try and make these predictions based on any play as it would be in reality.
  • Choosing to rule out 2 point conversions for the first version of this model
In [11]:
def load_all_season_passing_plays() -> pd.DataFrame:
    """
    Load all passing plays from 2021 to 2023.

    :return: DataFrame containing all passing plays.
    """
    play_by_play_df = pd.DataFrame()

    for year in range(2021, 2024):
        cur_year_pbp_df = pd.read_csv(
            f"../data/play_by_play_{year}.csv",
            header=0,
            low_memory=False
        )
        play_by_play_df = pd.concat([play_by_play_df, cur_year_pbp_df], ignore_index=True)

    # Passing plays, non 2 point conversions
    passing_plays_df = play_by_play_df[
        (play_by_play_df.play_type == "pass")
        & (~play_by_play_df.down.isna()) # My EDA revealed that the pass plays which have null down values are 2 point conversion attmepts.
    ]
    return passing_plays_df
In [12]:
passing_plays_df = load_all_season_passing_plays()
passing_plays_df.head()
Out[12]:
play_id game_id old_game_id home_team away_team season_type week posteam posteam_type defteam side_of_field yardline_100 game_date quarter_seconds_remaining half_seconds_remaining game_seconds_remaining game_half quarter_end drive sp qtr down goal_to_go time yrdln ydstogo ydsnet desc play_type yards_gained shotgun no_huddle qb_dropback qb_kneel qb_spike qb_scramble pass_length pass_location air_yards yards_after_catch run_location run_gap field_goal_result kick_distance extra_point_result two_point_conv_result home_timeouts_remaining away_timeouts_remaining timeout timeout_team td_team td_player_name td_player_id posteam_timeouts_remaining defteam_timeouts_remaining total_home_score total_away_score posteam_score defteam_score score_differential posteam_score_post defteam_score_post score_differential_post no_score_prob opp_fg_prob opp_safety_prob opp_td_prob fg_prob safety_prob td_prob extra_point_prob two_point_conversion_prob ep epa total_home_epa total_away_epa total_home_rush_epa total_away_rush_epa total_home_pass_epa total_away_pass_epa air_epa yac_epa comp_air_epa comp_yac_epa total_home_comp_air_epa total_away_comp_air_epa total_home_comp_yac_epa total_away_comp_yac_epa total_home_raw_air_epa total_away_raw_air_epa total_home_raw_yac_epa total_away_raw_yac_epa wp def_wp home_wp away_wp wpa vegas_wpa vegas_home_wpa home_wp_post away_wp_post vegas_wp vegas_home_wp total_home_rush_wpa total_away_rush_wpa total_home_pass_wpa total_away_pass_wpa air_wpa yac_wpa comp_air_wpa comp_yac_wpa total_home_comp_air_wpa total_away_comp_air_wpa total_home_comp_yac_wpa total_away_comp_yac_wpa total_home_raw_air_wpa total_away_raw_air_wpa total_home_raw_yac_wpa total_away_raw_yac_wpa punt_blocked first_down_rush first_down_pass first_down_penalty third_down_converted third_down_failed fourth_down_converted fourth_down_failed incomplete_pass touchback interception punt_inside_twenty punt_in_endzone punt_out_of_bounds punt_downed punt_fair_catch kickoff_inside_twenty kickoff_in_endzone kickoff_out_of_bounds kickoff_downed kickoff_fair_catch fumble_forced fumble_not_forced fumble_out_of_bounds solo_tackle safety penalty tackled_for_loss fumble_lost own_kickoff_recovery own_kickoff_recovery_td qb_hit rush_attempt pass_attempt sack touchdown pass_touchdown rush_touchdown return_touchdown extra_point_attempt two_point_attempt field_goal_attempt kickoff_attempt punt_attempt fumble complete_pass assist_tackle lateral_reception lateral_rush lateral_return lateral_recovery passer_player_id passer_player_name passing_yards receiver_player_id receiver_player_name receiving_yards rusher_player_id rusher_player_name rushing_yards lateral_receiver_player_id lateral_receiver_player_name lateral_receiving_yards lateral_rusher_player_id lateral_rusher_player_name lateral_rushing_yards lateral_sack_player_id lateral_sack_player_name interception_player_id interception_player_name lateral_interception_player_id lateral_interception_player_name punt_returner_player_id punt_returner_player_name lateral_punt_returner_player_id lateral_punt_returner_player_name kickoff_returner_player_name kickoff_returner_player_id lateral_kickoff_returner_player_id lateral_kickoff_returner_player_name punter_player_id punter_player_name kicker_player_name kicker_player_id own_kickoff_recovery_player_id own_kickoff_recovery_player_name blocked_player_id blocked_player_name tackle_for_loss_1_player_id tackle_for_loss_1_player_name tackle_for_loss_2_player_id tackle_for_loss_2_player_name qb_hit_1_player_id qb_hit_1_player_name qb_hit_2_player_id qb_hit_2_player_name forced_fumble_player_1_team forced_fumble_player_1_player_id forced_fumble_player_1_player_name forced_fumble_player_2_team forced_fumble_player_2_player_id forced_fumble_player_2_player_name solo_tackle_1_team solo_tackle_2_team solo_tackle_1_player_id solo_tackle_2_player_id solo_tackle_1_player_name solo_tackle_2_player_name assist_tackle_1_player_id assist_tackle_1_player_name assist_tackle_1_team assist_tackle_2_player_id assist_tackle_2_player_name assist_tackle_2_team assist_tackle_3_player_id assist_tackle_3_player_name assist_tackle_3_team assist_tackle_4_player_id assist_tackle_4_player_name assist_tackle_4_team tackle_with_assist tackle_with_assist_1_player_id tackle_with_assist_1_player_name tackle_with_assist_1_team tackle_with_assist_2_player_id tackle_with_assist_2_player_name tackle_with_assist_2_team pass_defense_1_player_id pass_defense_1_player_name pass_defense_2_player_id pass_defense_2_player_name fumbled_1_team fumbled_1_player_id fumbled_1_player_name fumbled_2_player_id fumbled_2_player_name fumbled_2_team fumble_recovery_1_team fumble_recovery_1_yards fumble_recovery_1_player_id fumble_recovery_1_player_name fumble_recovery_2_team fumble_recovery_2_yards fumble_recovery_2_player_id fumble_recovery_2_player_name sack_player_id sack_player_name half_sack_1_player_id half_sack_1_player_name half_sack_2_player_id half_sack_2_player_name return_team return_yards penalty_team penalty_player_id penalty_player_name penalty_yards replay_or_challenge replay_or_challenge_result penalty_type defensive_two_point_attempt defensive_two_point_conv defensive_extra_point_attempt defensive_extra_point_conv safety_player_name safety_player_id season cp cpoe series series_success series_result order_sequence start_time time_of_day stadium weather nfl_api_id play_clock play_deleted play_type_nfl special_teams_play st_play_type end_clock_time end_yard_line fixed_drive fixed_drive_result drive_real_start_time drive_play_count drive_time_of_possession drive_first_downs drive_inside20 drive_ended_with_score drive_quarter_start drive_quarter_end drive_yards_penalized drive_start_transition drive_end_transition drive_game_clock_start drive_game_clock_end drive_start_yard_line drive_end_yard_line drive_play_id_started drive_play_id_ended away_score home_score location result total spread_line total_line div_game roof surface temp wind home_coach away_coach stadium_id game_stadium aborted_play success passer passer_jersey_number rusher rusher_jersey_number receiver receiver_jersey_number pass rush first_down special play passer_id rusher_id receiver_id name jersey_number id fantasy_player_name fantasy_player_id fantasy fantasy_id out_of_bounds home_opening_kickoff qb_epa xyac_epa xyac_mean_yardage xyac_median_yardage xyac_success xyac_fd xpass pass_oe
3 76 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 TEN home ARI TEN 78.0 2021-09-12 863.0 1763.0 3563.0 Half1 0 1.0 0 1 2.0 0 14:23 TEN 22 13 0.0 (14:23) (Shotgun) 17-R.Tannehill pass short mi... pass 3.0 1 0 1.0 0 0 0 short middle 2.0 1.0 NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.004958 0.203120 0.002649 0.296200 0.167558 0.003786 0.321729 0.0 0.0 0.074293 0.032412 -1.367393 1.367393 -1.399805 1.399805 0.032412 -0.032412 -0.531589 0.564002 -0.531589 0.564002 -0.531589 0.531589 0.564002 -0.564002 -0.531589 0.531589 0.564002 -0.564002 0.520599 0.479401 0.520599 0.479401 -0.022280 0.015242 0.015242 0.498319 0.501681 0.511638 0.511638 -0.025663 0.025663 -0.022280 0.022280 0.0 -0.022280 0.0 -0.022280 0.0 0.0 -0.022280 0.022280 0.0 0.0 -0.022280 0.022280 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 00-0029701 R.Tannehill 3.0 00-0032764 D.Henry 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ARI NaN 00-0032129 NaN J.Hicks NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2021 0.818796 18.120378 1 0 Punt 76 9/12/21, 13:05:55 2021-09-12T17:07:14Z Nissan Stadium Sunny Temp: 78° F, Humidity: 63%, Wind: SSW 6 mph c59f3fe3-b37c-11eb-a824-966776c37c34 0 0 PASS 0 NaN NaN NaN 1 Punt 2021-09-12T17:05:55Z 3.0 1:33 0.0 0.0 0.0 1.0 1.0 0.0 KICKOFF PUNT 15:00 13:27 TEN 25 TEN 25 40.0 122.0 38 13 Home -25 51 2.5 54.0 0 outdoors grass 82.0 8.0 Mike Vrabel Kliff Kingsbury NAS00 Nissan Stadium 0 1.0 R.Tannehill 17.0 NaN NaN D.Henry 22.0 1 0 0.0 0 1 00-0029701 NaN 00-0032764 R.Tannehill 17.0 00-0029701 D.Henry 00-0032764 D.Henry 00-0032764 0 1 0.032412 1.165133 5.803177 4.0 0.896654 0.125098 0.697346 30.265415
4 100 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 TEN home ARI TEN 75.0 2021-09-12 822.0 1722.0 3522.0 Half1 0 1.0 0 1 3.0 0 13:42 TEN 25 10 0.0 (13:42) (Shotgun) 17-R.Tannehill pass incomple... pass 0.0 1 0 1.0 0 0 0 short right 10.0 NaN NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.005425 0.191106 0.002233 0.299509 0.178285 0.003580 0.319862 0.0 0.0 0.106705 -1.532898 -2.900290 2.900290 -1.399805 1.399805 -1.500485 1.500485 1.977100 -3.509998 0.000000 0.000000 -0.531589 0.531589 0.564002 -0.564002 1.445511 -1.445511 -2.945996 2.945996 0.498319 0.501681 0.498319 0.501681 -0.036612 -0.051595 -0.051595 0.461707 0.538293 0.526880 0.526880 -0.025663 0.025663 -0.058892 0.058892 0.0 -0.036612 0.0 0.000000 0.0 0.0 -0.022280 0.022280 0.0 0.0 -0.058892 0.058892 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 00-0029701 R.Tannehill NaN 00-0032355 C.Rogers NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN 00-0035236 B.Murphy NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2021 0.519577 -51.957655 1 0 Punt 100 9/12/21, 13:05:55 2021-09-12T17:07:54Z Nissan Stadium Sunny Temp: 78° F, Humidity: 63%, Wind: SSW 6 mph c59f3fe3-b37c-11eb-a824-966776c37c34 0 0 PASS 0 NaN NaN NaN 1 Punt 2021-09-12T17:05:55Z 3.0 1:33 0.0 0.0 0.0 1.0 1.0 0.0 KICKOFF PUNT 15:00 13:27 TEN 25 TEN 25 40.0 122.0 38 13 Home -25 51 2.5 54.0 0 outdoors grass 82.0 8.0 Mike Vrabel Kliff Kingsbury NAS00 Nissan Stadium 0 0.0 R.Tannehill 17.0 NaN NaN C.Rogers 80.0 1 0 0.0 0 1 00-0029701 NaN 00-0032355 R.Tannehill 17.0 00-0029701 C.Rogers 00-0032355 C.Rogers 00-0032355 0 1 -1.532898 0.256036 4.147637 2.0 0.965009 0.965009 0.978253 2.174652
6 152 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 ARI away TEN ARI 61.0 2021-09-12 807.0 1707.0 3507.0 Half1 0 2.0 0 1 1.0 0 13:27 ARI 39 10 45.0 (13:27) (Shotgun) 1-K.Murray pass deep left to... pass 38.0 1 0 1.0 0 0 0 deep left 29.0 9.0 NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.005188 0.113161 0.001777 0.213975 0.257016 0.004175 0.404707 0.0 0.0 1.771485 2.692890 -5.938473 5.938473 -1.399805 1.399805 -4.193375 4.193375 2.031169 0.661721 2.031169 0.661721 -2.562759 2.562759 -0.097719 0.097719 -0.585658 0.585658 -3.607717 3.607717 0.522434 0.477566 0.477566 0.522434 0.094349 0.076450 -0.076450 0.383217 0.616783 0.526610 0.473390 -0.025663 0.025663 -0.153241 0.153241 0.0 0.094349 0.0 0.094349 0.0 0.0 -0.116629 0.116629 0.0 0.0 -0.153241 0.153241 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 00-0035228 K.Murray 38.0 00-0030564 D.Hopkins 38.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN TEN NaN 00-0029681 NaN J.Jenkins NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2021 0.349448 65.055165 2 1 First down 152 9/12/21, 13:05:55 2021-09-12T17:09:19Z Nissan Stadium Sunny Temp: 78° F, Humidity: 63%, Wind: SSW 6 mph c59f3fe3-b37c-11eb-a824-966776c37c34 0 0 PASS 0 NaN NaN NaN 2 Field goal 2021-09-12T17:09:19Z 8.0 4:05 2.0 1.0 1.0 1.0 1.0 -25.0 PUNT FIELD_GOAL 13:27 09:22 ARI 39 TEN 16 152.0 432.0 38 13 Home -25 51 2.5 54.0 0 outdoors grass 82.0 8.0 Mike Vrabel Kliff Kingsbury NAS00 Nissan Stadium 0 1.0 K.Murray 1.0 NaN NaN D.Hopkins 10.0 1 0 1.0 0 1 00-0035228 NaN 00-0030564 K.Murray 1.0 00-0035228 D.Hopkins 00-0030564 D.Hopkins 00-0030564 1 1 2.692890 0.567838 7.420427 4.0 1.000000 1.000000 0.458989 54.101130
8 218 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 ARI away TEN TEN 31.0 2021-09-12 746.0 1646.0 3446.0 Half1 0 2.0 0 1 1.0 0 12:26 TEN 31 18 45.0 (12:26) (Shotgun) 1-K.Murray pass short left t... pass 1.0 1 0 1.0 0 0 0 short left -4.0 5.0 NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.002672 0.050829 0.000771 0.073216 0.495410 0.000989 0.376114 0.0 0.0 3.454468 -0.511090 -4.417475 4.417475 -0.389897 0.389897 -3.682285 3.682285 -0.809699 0.298609 -0.809699 0.298609 -1.753060 1.753060 -0.396328 0.396328 0.224041 -0.224041 -3.906326 3.906326 0.591982 0.408018 0.408018 0.591982 -0.011362 -0.024915 0.024915 0.419380 0.580620 0.573294 0.426706 -0.000862 0.000862 -0.141879 0.141879 0.0 -0.011362 0.0 -0.011362 0.0 0.0 -0.105267 0.105267 0.0 0.0 -0.141879 0.141879 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 00-0035228 K.Murray 1.0 00-0034681 C.Edmonds 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN TEN NaN 00-0034828 NaN H.Landry NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2021 0.888901 11.109942 3 1 First down 218 9/12/21, 13:05:55 2021-09-12T17:11:04Z Nissan Stadium Sunny Temp: 78° F, Humidity: 63%, Wind: SSW 6 mph c59f3fe3-b37c-11eb-a824-966776c37c34 0 0 PASS 0 NaN NaN NaN 2 Field goal 2021-09-12T17:09:19Z 8.0 4:05 2.0 1.0 1.0 1.0 1.0 -25.0 PUNT FIELD_GOAL 13:27 09:22 ARI 39 TEN 16 152.0 432.0 38 13 Home -25 51 2.5 54.0 0 outdoors grass 82.0 8.0 Mike Vrabel Kliff Kingsbury NAS00 Nissan Stadium 0 0.0 K.Murray 1.0 NaN NaN C.Edmonds 2.0 1 0 0.0 0 1 00-0035228 NaN 00-0034681 K.Murray 1.0 00-0035228 C.Edmonds 00-0034681 C.Edmonds 00-0034681 1 1 -0.511090 1.036891 10.339405 9.0 0.478471 0.079696 0.684949 31.505138
9 253 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 ARI away TEN TEN 30.0 2021-09-12 714.0 1614.0 3414.0 Half1 0 2.0 0 1 2.0 0 11:54 TEN 30 17 45.0 (11:54) (Shotgun) 1-K.Murray pass deep right t... pass 21.0 1 0 1.0 0 0 0 deep right 20.0 1.0 NaN NaN NaN NaN NaN NaN 3 3 0.0 NaN NaN NaN NaN 3.0 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.003365 0.056816 0.000692 0.086453 0.560816 0.001014 0.290844 0.0 0.0 2.943378 2.182015 -6.599490 6.599490 -0.389897 0.389897 -5.864300 5.864300 2.086775 0.095240 2.086775 0.095240 -3.839835 3.839835 -0.491568 0.491568 -1.862735 1.862735 -4.001566 4.001566 0.580620 0.419380 0.419380 0.580620 0.049807 0.078276 -0.078276 0.369573 0.630427 0.548379 0.451621 -0.000862 0.000862 -0.191685 0.191685 0.0 0.049807 0.0 0.049807 0.0 0.0 -0.155073 0.155073 0.0 0.0 -0.191685 0.191685 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 00-0035228 K.Murray 21.0 00-0027942 A.Green 21.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN TEN NaN 00-0035632 NaN A.Hooker NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN NaN 2021 0.407577 59.242308 3 1 First down 253 9/12/21, 13:05:55 2021-09-12T17:12:18Z Nissan Stadium Sunny Temp: 78° F, Humidity: 63%, Wind: SSW 6 mph c59f3fe3-b37c-11eb-a824-966776c37c34 0 0 PASS 0 NaN NaN NaN 2 Field goal 2021-09-12T17:09:19Z 8.0 4:05 2.0 1.0 1.0 1.0 1.0 -25.0 PUNT FIELD_GOAL 13:27 09:22 ARI 39 TEN 16 152.0 432.0 38 13 Home -25 51 2.5 54.0 0 outdoors grass 82.0 8.0 Mike Vrabel Kliff Kingsbury NAS00 Nissan Stadium 0 1.0 K.Murray 1.0 NaN NaN A.Green 18.0 1 0 1.0 0 1 00-0035228 NaN 00-0027942 K.Murray 1.0 00-0035228 A.Green 00-0027942 A.Green 00-0027942 1 1 2.182015 0.517965 3.045047 1.0 1.000000 0.998799 0.775463 22.453719
In [13]:
alt.Chart(passing_plays_df[helpful_fields]).mark_bar().encode(
    x=alt.X("yardline_100:Q", bin=alt.Bin(maxbins=20)),
    y=alt.Y("mean(sack):Q", title="Percentage of pass plays resulting in a sack").stack(False),
    color=alt.Color("down:N", title="Down"),
    column=alt.Column("down:N", title="Down"),
    row=alt.Row("season:O", title="Quarter"),
    # facet=alt.Facet("defteam:N", columns=8, title="Defensive Team"),
    opacity=alt.value(0.75)
).properties(
    title=alt.Title("Percentage of pass plays resulting in a sack at a given yard line in a given season", fontSize=25)
)
Out[13]:
In [14]:
alt.Chart(passing_plays_df[predictive_fields]).mark_bar().encode(
    x=alt.X("yardline_100:Q", bin=alt.Bin(maxbins=20)),
    y=alt.Y("mean(sack):Q", title="Percentage of pass plays resulting in a sack").stack(False),
    color=alt.Color("down:N", title="Down"),
    column=alt.Column("down:N", title="Down"),
    # facet=alt.Facet("defteam:N", columns=8, title="Defensive Team"),
    opacity=alt.value(0.75)
).properties(
    title=alt.Title("Percentage of pass plays resulting in a sack at a given yard line (2021-2023)", fontSize=25)
)
Out[14]:
In [15]:
def prepare_data_for_training(
    passing_plays_df: pd.DataFrame,
    predictive_fields: list[str],
    fields_to_encode: list[str],
    do_standard_scale: bool = True,
    label_field: str = "sack",
) -> pd.DataFrame:
    """
    Prepare the passing plays DataFrame for training by selecting predictive fields,
    encoding categorical fields, and optionally standard scaling the data.

    :param passing_plays_df: DataFrame containing passing plays data.
    :param predictive_fields: List of fields to use as predictors.
    :param fields_to_encode: List of fields to encode using one-hot encoding.
    :param do_standard_scale: Whether to standard scale the data.

    :return: Prepared DataFrame ready for training.
    """
    if not set(fields_to_encode).issubset(set(predictive_fields)):
        raise ValueError(
            f"Fields to encode {fields_to_encode} must be a subset of predictive fields {predictive_fields}"
        )
    
    passing_plays_subset_df = passing_plays_df[predictive_fields]
    passing_plays_subset_df = passing_plays_subset_df.astype({"down": int})  # hard coded for now, fix later
    passing_plays_subset_df = pd.get_dummies(
        passing_plays_subset_df,
        columns=fields_to_encode,
        dtype=int
    )

    if do_standard_scale:
        temp_df = passing_plays_subset_df.copy()
        temp_df = temp_df.drop(columns=[label_field])

        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(temp_df)

        temp_df = pd.DataFrame(
            scaled_data,
            columns=temp_df.columns,
        )
        temp_df[label_field] = passing_plays_subset_df[label_field].values
        
        passing_plays_subset_df = temp_df.copy()

    return passing_plays_subset_df


def get_training_test_sets(
    prepared_df: pd.DataFrame,
) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
    """
    """
    x_train, x_test, y_train, y_test = train_test_split(
        prepared_df.drop(columns=["sack"]),
        prepared_df["sack"],
        test_size=0.2,
        random_state=42
    )
    return x_train, y_train, x_test, y_test


def build_model_record(
    model_id: int,
    model_name: str,
    model: object,
    x_test: pd.DataFrame,
    y_test: pd.Series,
    desc: Optional[str] = None,
    standard_scaled: bool = False
) -> dict:
    """
    Build a record of the model's performance metrics.

    :param model_name: Name of the model.
    :param model: The trained model object.
    :param x_test: Test features DataFrame.
    :param y_test: Test labels Series.
    :param desc: Optional description of the model.
    :param standard_scaled: Whether the data was standard scaled.

    :return: Dictionary containing model performance metrics.
    """
    y_predict = model.predict(x_test)
    accuracy_curr = accuracy_score(y_test, y_predict)
    precision_curr = precision_score(y_test, y_predict, zero_division=0)
    recall_curr = recall_score(y_test, y_predict, zero_division=0)
    f1_curr = f1_score(y_test, y_predict, zero_division=0)

    if hasattr(model, 'class_weight'):
        class_weighting = bool(model.class_weight)
    elif hasattr(model, 'scale_pos_weight'):
        class_weighting = bool(model.scale_pos_weight)
    else:
        # Only GuassianNB should get here
        class_weighting = True
    
    model_record = {
        "model_id": model_id,
        "model": model_name,
        "desc": desc,
        "accuracy": accuracy_curr,
        "precision": precision_curr,
        "recall": recall_curr,
        "f1_score": f1_curr,
        "standard_scaled": standard_scaled,
        "class_weighting": class_weighting,
    }
    return model_record
In [16]:
# passing_plays_df should already be loaded
# passing_plays_df = load_all_season_passing_plays()
prepared_df = prepare_data_for_training(
    passing_plays_df=passing_plays_df,
    predictive_fields=predictive_fields,
    fields_to_encode=["down", "qtr"],
    do_standard_scale=False,
)
prepared_df.head()
Out[16]:
yardline_100 quarter_seconds_remaining ydstogo sack down_1 down_2 down_3 down_4 qtr_1 qtr_2 qtr_3 qtr_4 qtr_5
3 78.0 863.0 13 0.0 0 1 0 0 1 0 0 0 0
4 75.0 822.0 10 0.0 0 0 1 0 1 0 0 0 0
6 61.0 807.0 10 0.0 1 0 0 0 1 0 0 0 0
8 31.0 746.0 18 0.0 1 0 0 0 1 0 0 0 0
9 30.0 714.0 17 0.0 0 1 0 0 1 0 0 0 0
In [17]:
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_df)
In [18]:
lr_model = LogisticRegression(
    penalty="l2",
    solver="lbfgs",
    max_iter=1000,
    random_state=42,
)
lr_model.fit(x_train, y_train)
Out[18]:
LogisticRegression(max_iter=1000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, random_state=42)
In [19]:
model_performance_df = pd.DataFrame({
    "model_id": pd.Series(dtype="int"),
    "model": pd.Series(dtype="object"),
    "desc": pd.Series(dtype="object"),
    "accuracy": pd.Series(dtype="float"),
    "precision": pd.Series(dtype="float"),
    "recall": pd.Series(dtype="float"),
    "f1_score": pd.Series(dtype="float"),
    "standard_scaled": pd.Series(dtype="bool"),
    "class_weighting": pd.Series(dtype="bool"),
})
In [20]:
def record_model_results(
    model_performance_df: pd.DataFrame,
    model_name: str,
    model: object,
    x_test: pd.DataFrame,
    y_test: pd.Series,
    desc: Optional[str] = None,
    standard_scaled: bool = False
) -> pd.DataFrame:
    """
    Record the results of a model's performance and update the model performance DataFrame.

    :param model_performance_df: DataFrame to store model performance records.
    :param model_name: Name of the model.
    :param model: The trained model object.
    :param x_test: Test features DataFrame.
    :param y_test: Test labels Series.
    :param desc: Optional description of the model.
    :param standard_scaled: Whether the data was standard scaled.
    
    :return: Updated model performance DataFrame with the new model record.
    """
    model_id = len(model_performance_df)
    model_record = build_model_record(
        model_id=model_id,
        model_name=model_name,
        model=model,
        x_test=x_test,
        y_test=y_test,
        desc=desc,
        standard_scaled=standard_scaled
    )
    new_rows = [model_record]
    new_df = pd.DataFrame(new_rows)
    model_performance_df = pd.concat([model_performance_df, new_df], ignore_index=True)
    return model_performance_df
In [21]:
model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Base Logistic Regression",
    model=lr_model,
    x_test=x_test,
    y_test=y_test,
    desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with no standard scaling.",
    standard_scaled=False
)
model_performance_df
Out[21]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.0 0.0 0.0 False False
In [22]:
prepared_scaled_df = prepare_data_for_training(
    passing_plays_df=passing_plays_df,
    predictive_fields=predictive_fields,
    fields_to_encode=["down", "qtr"],
    do_standard_scale=True,
)
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_scaled_df)
lr_model_2 = LogisticRegression(
    penalty="l2",
    solver="lbfgs",
    max_iter=1000,
    random_state=42,
)
lr_model_2.fit(x_train, y_train)
Out[22]:
LogisticRegression(max_iter=1000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, random_state=42)
In [23]:
model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Base Logistic Regression",
    model=lr_model_2,
    x_test=x_test,
    y_test=y_test,
    desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
    standard_scaled=True
)
model_performance_df
Out[23]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.0 0.0 0.0 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.0 0.0 0.0 True False
In [24]:
class_weight = y_train.mean()

lr_model_3 = LogisticRegression(
    penalty="l2",
    solver="lbfgs",
    max_iter=1000,
    random_state=42,
    class_weight={1.0: 1 - class_weight, 0.0: class_weight},
)
lr_model_3.fit(x_train, y_train)

model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Base Logistic Regression",
    model=lr_model_3,
    x_test=x_test,
    y_test=y_test,
    desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
    standard_scaled=False
)
model_performance_df
Out[24]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
In [25]:
# passing_plays_df.game_date = pd.to_datetime(passing_plays_df.game_date, format="%Y-%m-%d")
# passing_plays_df["game_date"]
In [26]:
predictive_fields = ["yardline_100", "quarter_seconds_remaining", "qtr", "down", "ydstogo", "sack"]
extended_predictive_fields = predictive_fields + ["defteam", "posteam"]

prepared_extended_df = prepare_data_for_training(
    passing_plays_df=passing_plays_df,
    predictive_fields=extended_predictive_fields,
    fields_to_encode=["down", "qtr", "defteam", "posteam"],
    do_standard_scale=False,
)
prepared_extended_df.head()
Out[26]:
yardline_100 quarter_seconds_remaining ydstogo sack down_1 down_2 down_3 down_4 qtr_1 qtr_2 qtr_3 qtr_4 qtr_5 defteam_ARI defteam_ATL defteam_BAL defteam_BUF defteam_CAR defteam_CHI defteam_CIN defteam_CLE defteam_DAL defteam_DEN defteam_DET defteam_GB defteam_HOU defteam_IND defteam_JAX defteam_KC defteam_LA defteam_LAC defteam_LV defteam_MIA defteam_MIN defteam_NE defteam_NO defteam_NYG defteam_NYJ defteam_PHI defteam_PIT defteam_SEA defteam_SF defteam_TB defteam_TEN defteam_WAS posteam_ARI posteam_ATL posteam_BAL posteam_BUF posteam_CAR posteam_CHI posteam_CIN posteam_CLE posteam_DAL posteam_DEN posteam_DET posteam_GB posteam_HOU posteam_IND posteam_JAX posteam_KC posteam_LA posteam_LAC posteam_LV posteam_MIA posteam_MIN posteam_NE posteam_NO posteam_NYG posteam_NYJ posteam_PHI posteam_PIT posteam_SEA posteam_SF posteam_TB posteam_TEN posteam_WAS
3 78.0 863.0 13 0.0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
4 75.0 822.0 10 0.0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
6 61.0 807.0 10 0.0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 31.0 746.0 18 0.0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 30.0 714.0 17 0.0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [27]:
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_extended_df)
lr_model_4 = LogisticRegression(
    penalty="l2",
    solver="lbfgs",
    max_iter=1000,
    random_state=42,
)
lr_model_4.fit(x_train, y_train)

model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Extended Logistic Regression",
    model=lr_model_4,
    x_test=x_test,
    y_test=y_test,
    desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with no standard scaling.",
    standard_scaled=False
)
model_performance_df
Out[27]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
In [28]:
lr_model_5 = LogisticRegression(
    penalty="l2",
    solver="lbfgs",
    max_iter=1000,
    random_state=42,
    class_weight={1.0: 1 - class_weight, 0.0: class_weight},
)
lr_model_5.fit(x_train, y_train)

model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Extended Logistic Regression",
    model=lr_model_5,
    x_test=x_test,
    y_test=y_test,
    desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with no standard scaling.",
    standard_scaled=False
)
model_performance_df
Out[28]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
4 4 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.612661 0.089121 0.557692 0.153683 False True
In [29]:
model_performance_df["class_weighting"] = pd.Series([False, False, True, False, True], dtype="bool")
model_performance_df
Out[29]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
4 4 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.612661 0.089121 0.557692 0.153683 False True
In [30]:
lr_model.class_weight
In [31]:
fig, ax = plt.subplots(figsize=(5, 5), dpi=160)

cm = confusion_matrix(y_test, lr_model_5.predict(x_test), labels=[0, 1])
ConfusionMatrixDisplay(cm).plot(colorbar=False, ax=ax)
plt.title("Confusion Matrix for Logistic\nRegression Model using class weighting", fontsize=16)
plt.show()
No description has been provided for this image
In [32]:
(lr_model_5.predict_proba(x_test).argmax(axis=1) == lr_model_5.predict(x_test)).all()
Out[32]:
np.True_

The above cell reveals that logistic regression as a classifier predicts the class which belongs to the higher probability

In [33]:
TP = 435
FN = 345
FP = 4446

TP / (TP + FN), TP / (TP + FP)  # Recall and Precision respectively
Out[33]:
(0.5576923076923077, 0.08912108174554395)

Describe why we want to maximize recall in our case:¶

  • We have inbalanced classes this causes big problems in Machine Learning algorithms which typically expect to see approximately the same number of instances of each class in a classification problem.
In [34]:
probas = lr_model_4.predict_proba(x_test)[:,1]
random_nums = np.random.rand(len(probas))
predictions = (probas > random_nums).astype(int)
predictions.sum()
Out[34]:
np.int64(893)
In [35]:
probas
Out[35]:
array([0.10554114, 0.03422494, 0.03551765, ..., 0.06310447, 0.11625848,
       0.04925789])
In [36]:
random_nums
Out[36]:
array([0.04442786, 0.69004933, 0.7281382 , ..., 0.30958478, 0.76881682,
       0.43656398])
In [37]:
predictions
Out[37]:
array([1, 0, 0, ..., 0, 0, 0])
In [38]:
recall_score(y_test, predictions, zero_division=0), precision_score(y_test, predictions, zero_division=0)
Out[38]:
(0.09871794871794871, 0.08622620380739082)
In [39]:
fig, ax = plt.subplots(figsize=(5, 5), dpi=160)

cm = confusion_matrix(y_test, predictions, labels=[0, 1])
ConfusionMatrixDisplay(cm).plot(colorbar=False, ax=ax)
plt.title("Confusion Matrix for Logistic\nRegression Model using class weighting", fontsize=16)
plt.show()
No description has been provided for this image
In [40]:
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_scaled_df)

gauss_naive_bayes = GaussianNB()
sample_weight = compute_sample_weight(
    class_weight={
        0: y_train.mean(),
        1: 1 - y_train.mean()
    },
    y=y_train,
)
gauss_naive_bayes.fit(x_train, y_train, sample_weight=sample_weight)

precision_score(y_test, gauss_naive_bayes.predict(x_test), labels=[0, 1], zero_division=0), recall_score(y_test, gauss_naive_bayes.predict(x_test), labels=[0, 1], zero_division=0)
Out[40]:
(0.09355950172276703, 0.45256410256410257)
In [41]:
model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Naive Bayes",
    model=gauss_naive_bayes,
    x_test=x_test,
    y_test=y_test,
    desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
    standard_scaled=True
)
model_performance_df
Out[41]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
4 4 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.612661 0.089121 0.557692 0.153683 False True
5 5 Naive Bayes Logistic Regression model trained on yardline_... 0.688981 0.093560 0.452564 0.155063 True True
In [42]:
gauss_naive_bayes.predict_proba(x_test)
Out[42]:
array([[0.30963618, 0.69036382],
       [0.6207606 , 0.3792394 ],
       [0.59460749, 0.40539251],
       ...,
       [0.58598141, 0.41401859],
       [0.32487958, 0.67512042],
       [0.05348466, 0.94651534]])
In [43]:
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_extended_df)

gauss_naive_bayes = GaussianNB()
sample_weight = compute_sample_weight(
    class_weight={
        0: y_train.mean(),
        1: 1 - y_train.mean()
    },
    y=y_train,
)
gauss_naive_bayes.fit(x_train, y_train, sample_weight=sample_weight)

precision_score(y_test, gauss_naive_bayes.predict(x_test), labels=[0, 1], zero_division=0), recall_score(y_test, gauss_naive_bayes.predict(x_test), labels=[0, 1], zero_division=0)
Out[43]:
(0.07611408199643493, 0.5474358974358975)
In [44]:
model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Extended Naive Bayes",
    model=gauss_naive_bayes,
    x_test=x_test,
    y_test=y_test,
    desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
    standard_scaled=True
)
model_performance_df
Out[44]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
4 4 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.612661 0.089121 0.557692 0.153683 False True
5 5 Naive Bayes Logistic Regression model trained on yardline_... 0.688981 0.093560 0.452564 0.155063 True True
6 6 Extended Naive Bayes Logistic Regression model trained on yardline_... 0.552429 0.076114 0.547436 0.133646 True True
In [45]:
gauss_naive_bayes.predict_proba(x_test)[:10]
Out[45]:
array([[1.35279132e-01, 8.64720868e-01],
       [9.99788922e-01, 2.11078093e-04],
       [9.99862102e-01, 1.37898363e-04],
       [1.13602332e-02, 9.88639767e-01],
       [1.71591152e-01, 8.28408848e-01],
       [5.23130132e-01, 4.76869868e-01],
       [9.93957091e-01, 6.04290909e-03],
       [1.59092952e-03, 9.98409070e-01],
       [9.99832936e-01, 1.67063909e-04],
       [6.78570265e-02, 9.32142973e-01]])
In [46]:
lr_model_5.decision_function(x_test)
Out[46]:
array([ 0.53337741, -0.77563037, -0.65725411, ..., -0.08239654,
        0.57091274, -0.3750084 ])
In [47]:
display = PrecisionRecallDisplay.from_estimator(
    gauss_naive_bayes, x_test, y_test, name="Gauss Naive Bayes", plot_chance_level=True, despine=True
)
display.ax_.set_xlabel("Recall")
# display.ax_.set_ylabel("Precision")
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.legend(loc="upper right")
plt.show()
No description has been provided for this image
In [48]:
# precision, recall, thresholds = precision_recall_curve(y_test, lr_model.predict_proba(x_test)[:, 1], pos_label=1)
fig, ax = plt.subplots(figsize=(5, 5), dpi=160)

for model in [lr_model_4, lr_model_5, gauss_naive_bayes]:
    display = PrecisionRecallDisplay.from_estimator(
        model, x_test, y_test, name="Gauss Naive Bayes", plot_chance_level=True, despine=True, ax=ax
    )
    display.ax_.set_xlabel("Recall")
    # display.ax_.set_ylabel("Precision")
    _ = display.ax_.set_title("2-class Precision-Recall curve")
    plt.legend(loc="upper right")

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve for the Logistic Regression Model")
plt.legend()
plt.show()
No description has been provided for this image

Next steps¶

I want to incorporate the information about individual players who are likely on the field at the time. Information about how many times the quarterback had been sacked the previous year and how many total sacks the team had a previous year. We could also be more spectific about the player level sack totals and gauge the likelihood of a sack based in part on that information.

  • advstats_week_def_{year}
  • advstats_week_pass_{year}
  • advstats_week_rush_{year}
  • depth_charts_{year}
  • play_by_play_{year}
  • players
  • roster_{year}
  • snap_counts_{year}
In [49]:
def load_def_advstats():
    """
    Load passer advanced statistics from the CSV file.

    :return: DataFrame containing passer advanced statistics.
    """
    advstats_df = pd.DataFrame()

    for year in range(2021, 2024):
        cur_year_advs_df = pd.read_csv(
            f"../data/advstats_week_def_{year}.csv",
            header=0,
        )
        advstats_df = pd.concat([advstats_df, cur_year_advs_df], ignore_index=True)
    
    return advstats_df


def load_passer_advstats():
    """
    Load passer advanced statistics from the CSV file.

    :return: DataFrame containing passer advanced statistics.
    """
    advstats_df = pd.DataFrame()

    for year in range(2021, 2024):
        cur_year_advs_df = pd.read_csv(
            f"../data/advstats_week_pass_{year}.csv",
            header=0,
        )
        advstats_df = pd.concat([advstats_df, cur_year_advs_df], ignore_index=True)
    
    return advstats_df
In [50]:
def estimate_unknown_season_sack_related_data(
    df: pd.DataFrame,
    fields: list[str],
    team_type: str,
    prev_season: int = 2020
) -> pd.DataFrame:
    """
    Estimate the previous season's sack related data for a year whose data is not available.

    :param df: DataFrame containing sack related data.
    :param fields: List of fields to estimate.
    :param team_type: Type of team to filter by (e.g., "posteam" or "defteam").
    :param prev_season: The season to use for estimation (default is 2020).

    :return: DataFrame with estimated previous season's sack related data.
    """
    temp_df = df.copy()
    temp_df.drop(columns=["prev_season"], inplace=True)
    temp_df = temp_df.groupby(team_type, as_index=False).mean()
    for field in fields:
        temp_df[field] = temp_df[field].astype(int)  # Cast to integers because they are counts
    temp_df["prev_season"] = prev_season
    return temp_df


def process_advstats(
    advstats_df: pd.DataFrame,
    team_type: str,
    fields: list[str],
) -> pd.DataFrame:
    """
    Process advanced statistics DataFrame by selecting relevant fields.

    :param advstats_df: DataFrame containing passer or def advanced statistics.
    :param team_type: Type of team to filter by (e.g., "posteam" or "defteam").
    :param fields: List of fields to select from the DataFrame.

    :return: Processed DataFrame with selected fields.
    """
    passer_sack_agg_fields_with_prefix = {
        f"prev_szn{('_' + team_type) if team_type == 'posteam' else ''}_{field}": (field, "sum")
        for field in fields
    }

    advstats_df = advstats_df.astype({"season": int})
    prev_season_advstats_df = advstats_df.groupby(
        by=["team", "season"],
        as_index=False
    ).agg(**passer_sack_agg_fields_with_prefix)
    prev_season_advstats_df.rename(
        columns={"team": team_type, "season": "prev_season"},
        inplace=True,
    )
    estimated_2020_stats = estimate_unknown_season_sack_related_data(
        prev_season_advstats_df,
        prev_season_advstats_df.columns.difference([team_type, "prev_season"]),
        team_type=team_type,
        prev_season=2020
    )
    prev_season_advstats_df = pd.concat(
        [prev_season_advstats_df, estimated_2020_stats],
        ignore_index=True
    )
    return prev_season_advstats_df


def enrich_passing_plays_data_with_prev_szn_stats(
    passing_plays_df: pd.DataFrame,
    fields: list[str],
) -> pd.DataFrame:
    """
    Prepare the passing plays DataFrame for training by merging it with advanced statistics
    from the previous season. This includes relevant sack-related statistics for both
    the offensive and defensive teams.

    :param passing_plays_df: DataFrame containing passing plays data.
    :param fields: List of fields to use as predictors.
    
    :return: DataFrame ready for training with advanced statistics merged.
    """
    advstats_passer_df = load_passer_advstats()
    advstats_def_df = load_def_advstats()

    passer_sack_relevant_fields = ["times_sacked", "times_blitzed", "times_hurried", "times_hit", "times_pressured"]
    posteam_advstats_df = process_advstats(         
        advstats_df=advstats_passer_df,
        team_type="posteam",
        fields=passer_sack_relevant_fields,
    )
    def_sack_relevant_fields = ["def_times_blitzed", "def_times_hurried", "def_times_hitqb", "def_sacks", "def_pressures"]
    defteam_advstats_df = process_advstats(
        advstats_df=advstats_def_df,
        team_type="defteam",
        fields=def_sack_relevant_fields,
    )

    # Augment the passing plays DataFrame with minimal metadata
    augmented_fields = fields + ["game_id"]
    passing_plays_with_minimal_metadata = passing_plays_df[augmented_fields].copy()
    passing_plays_with_minimal_metadata["prev_season"] = passing_plays_with_minimal_metadata.apply(
        lambda row: int(row["game_id"][:4]) - 1,
        axis=1
    )
    passing_plays_with_minimal_metadata.drop(columns=["game_id"], inplace=True)

    # Merge the passing plays DataFrame with the advanced statistics DataFrames
    passing_plays_with_minimal_metadata = passing_plays_with_minimal_metadata.merge(
        posteam_advstats_df,
        how="left",
        on=["posteam", "prev_season"],
    )
    passing_plays_with_minimal_metadata = passing_plays_with_minimal_metadata.merge(
        defteam_advstats_df,
        how="left",
        on=["defteam", "prev_season"],
    )
    passing_plays_with_minimal_metadata.drop(columns=["prev_season"], inplace=True)
    return passing_plays_with_minimal_metadata
In [51]:
passing_training_data_df = enrich_passing_plays_data_with_prev_szn_stats(
    passing_plays_df=passing_plays_df,
    fields=extended_predictive_fields,
)
In [52]:
passing_training_data_df.drop(columns=["posteam", "defteam"], inplace=True)
In [53]:
prepared_enriched_df = prepare_data_for_training(
    passing_training_data_df,
    passing_training_data_df.columns,
    fields_to_encode=["down", "qtr"], # , "defteam", "posteam"],
    do_standard_scale=True,
    label_field="sack",
)
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_enriched_df)
lr_model_6 = LogisticRegression(
    penalty="l2",
    solver="lbfgs",
    max_iter=1000,
    random_state=42,
    class_weight={1.0: 1 - y_train.mean(), 0.0: y_train.mean()},
)
lr_model_6.fit(x_train, y_train)
Out[53]:
LogisticRegression(class_weight={0.0: np.float64(0.06771507115135834),
                                 1.0: np.float64(0.9322849288486417)},
                   max_iter=1000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight={0.0: np.float64(0.06771507115135834),
                                 1.0: np.float64(0.9322849288486417)},
                   max_iter=1000, random_state=42)
In [54]:
model_performance_df
Out[54]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
4 4 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.612661 0.089121 0.557692 0.153683 False True
5 5 Naive Bayes Logistic Regression model trained on yardline_... 0.688981 0.093560 0.452564 0.155063 True True
6 6 Extended Naive Bayes Logistic Regression model trained on yardline_... 0.552429 0.076114 0.547436 0.133646 True True
In [55]:
model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Enriched Logistic Regression with Previous Season Stats",
    model=lr_model_6,
    x_test=x_test,
    y_test=y_test,
    desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
    standard_scaled=True
)
model_performance_df
Out[55]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
4 4 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.612661 0.089121 0.557692 0.153683 False True
5 5 Naive Bayes Logistic Regression model trained on yardline_... 0.688981 0.093560 0.452564 0.155063 True True
6 6 Extended Naive Bayes Logistic Regression model trained on yardline_... 0.552429 0.076114 0.547436 0.133646 True True
7 7 Enriched Logistic Regression with Previous Sea... Logistic Regression model trained on yardline_... 0.646940 0.090057 0.505128 0.152861 True True
In [56]:
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=5,
    random_state=42,
    class_weight={1.0: 1 - y_train.mean(), 0.0: y_train.mean()},
)
rf_model.fit(x_train, y_train)
model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="Random Forest Classifier with Previous Season Stats",
    model=rf_model,
    x_test=x_test,
    y_test=y_test,
    desc=f"Random Forest Classifier model trained on {', '.join(rf_model.feature_names_in_)} features with a standard scaler applied.",
    standard_scaled=True
)
model_performance_df
Out[56]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
4 4 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.612661 0.089121 0.557692 0.153683 False True
5 5 Naive Bayes Logistic Regression model trained on yardline_... 0.688981 0.093560 0.452564 0.155063 True True
6 6 Extended Naive Bayes Logistic Regression model trained on yardline_... 0.552429 0.076114 0.547436 0.133646 True True
7 7 Enriched Logistic Regression with Previous Sea... Logistic Regression model trained on yardline_... 0.646940 0.090057 0.505128 0.152861 True True
8 8 Random Forest Classifier with Previous Season ... Random Forest Classifier model trained on yard... 0.699733 0.097862 0.457692 0.161247 True True
In [57]:
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=5,
    random_state=42,
    eval_metric="logloss",
    scale_pos_weight=((len(y_train) - y_train.sum()) / y_train.sum())
)
xgb_model.fit(x_train, y_train)
model_performance_df = record_model_results(
    model_performance_df=model_performance_df,
    model_name="XGBoost Classifier with Previous Season Stats",
    model=xgb_model,
    x_test=x_test,
    y_test=y_test,
    desc=f"XGBoost Classifier model trained on {', '.join(xgb_model.feature_names_in_)} features with a standard scaler applied.",
    standard_scaled=True
)
model_performance_df
Out[57]:
model_id model desc accuracy precision recall f1_score standard_scaled class_weighting
0 0 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
1 1 Base Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 True False
2 2 Base Logistic Regression Logistic Regression model trained on yardline_... 0.686717 0.092227 0.448718 0.153005 False True
3 3 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.936939 0.000000 0.000000 0.000000 False False
4 4 Extended Logistic Regression Logistic Regression model trained on yardline_... 0.612661 0.089121 0.557692 0.153683 False True
5 5 Naive Bayes Logistic Regression model trained on yardline_... 0.688981 0.093560 0.452564 0.155063 True True
6 6 Extended Naive Bayes Logistic Regression model trained on yardline_... 0.552429 0.076114 0.547436 0.133646 True True
7 7 Enriched Logistic Regression with Previous Sea... Logistic Regression model trained on yardline_... 0.646940 0.090057 0.505128 0.152861 True True
8 8 Random Forest Classifier with Previous Season ... Random Forest Classifier model trained on yard... 0.699733 0.097862 0.457692 0.161247 True True
9 9 XGBoost Classifier with Previous Season Stats XGBoost Classifier model trained on yardline_1... 0.771687 0.096367 0.312821 0.147343 True True
In [58]:
rf_model.predict_proba(x_test)[:10]
Out[58]:
array([[0.43015753, 0.56984247],
       [0.59379433, 0.40620567],
       [0.57414669, 0.42585331],
       [0.53027696, 0.46972304],
       [0.52337722, 0.47662278],
       [0.55743929, 0.44256071],
       [0.56855249, 0.43144751],
       [0.50996868, 0.49003132],
       [0.61490568, 0.38509432],
       [0.40005522, 0.59994478]])
  • Issue with this initial model because the model can get a very high accuracy by always predicting there is not a sack
  • Basically big issue is the class imbalance.
  • Consider just constructing the probability by the empirical distribution (average given the set of constraints)
    • Can I use Naive Bayes or a Decision Tree for this?

Important ideas for next steps¶

  • Err on the side of overestimating the probability of something happeining (results in lower payout)
  • Either optimize for / compare the model performances with Precision, Recall, or F1 score. Accuracy is going to be misleading to due class imbalance
In [ ]: